split on last occurrence of digit, take 2nd part - regex

If I have a string and want to split on the last digit and keep the last part of the split hpw can I do that?
x <- c("ID", paste0("X", 1:10, state.name[1:10]))
I'd like
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
But would settle for:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
I can get the first part by:
unlist(strsplit(x, "[^0-9]*$"))
But want the second part.
Thank you in advance.

You can do this one easy step with a regular expression:
gsub("(^.*\\d+)(\\w*)", "\\2", x)
Results in:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
[9] "Delaware" "Florida" "Georgia"
What the regex does:
"(^.*\\d+)(\\w*)": Look for two groups of characters.
The first group (^.*\\d+) looks for any digit followed by at least one number at the start of the string.
The second group \\w* looks for an alpha-numeric character.
The "\\2" as the second argument to gsub() means to replace the original string with the second group that the regex found.

library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))
gives
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" "Delaware"
[10] "Florida" "Georgia"
I would look at the documentation stringr for (most possibly) an even better approach.

This seems a bit clunky, but it works:
state.pt2 <- unlist(strsplit(x,"^.[0-9]+"))
state.pt2[state.pt2!=""]
It would be nice to remove the ""'s generated by the match at the start of the string but I can't figure that out.
Here's another method using substr and gregexpr too that avoids having to subset the results:
substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))

gsubfn
Try this gsubfn solution:
> library(gsubfn)
> strapply(x, ".*\\d(\\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
It matches the last digit followed by word characters and returns the word characters or if that fails it matches the end of line (to ensure that it matches something). If the first match succeeded then return it; otherwise, the back reference will be empty so return NA.
Note that the formula is a short hand way of writing the function function(z) if (nchar(z)) z else NA and that function could alternately replace the formula at the expense of a slightly more keystrokes.
gsub
A similar strategy could also work using just straight gsub but requires two lines and a marginally more complex regular expression. Here we use the second alternative to slurp up non-matches from the first alternative:
> s <- gsub(".*\\d(\\w*)|.*", "\\1", x)
> ifelse(nchar(s), s, NA)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
EDIT: minor improvements

Related

Matching special character in R

Hi I have the following data.
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2",
"appple+20gfree",
"BELI HG MSWAT ALA +VAT T 100g BAR WR",
"TOOLAIT CASSE+LSST+SSSRE 40g SAC MDC")
In my second step I remove all whitespace in shopping_list.
require(stringr)
shopping_list_trim <- str_replace_all(shopping_list, fixed(" "), "")
print(shopping_list_trim)
[1] "applesx4" "bagofflour" "bagofsugar"
[4] "milkx2" "appple+20gfree" "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
If I want to extract the string that does not contain plus sign I use the following code.
str_extract(shopping_list_trim, "^[^+]+$")
[1] "applesx4" "bagofflour" "bagofsugar" "milkx2" NA NA NA
Would like to help to extract the string that contain plus sign.
I would like the output to be the following one.
NA NA NA NA "appple+20gfree"
"BELIHGMSWATALA+VATT100gBARWR" "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Does anybody have idea how to extract only string that contains plus sign?
This will do the trick
> str_extract(shopping_list_trim, "^(?=.*\\+)(.+)$")
[1] NA
[2] NA
[3] NA
[4] NA
[5] "appple+20gfree"
[6] "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Regex Breakdown
^(?=.*\\+) #Lookahead to check if there is one plus sign
(.+)$ #Capture the string if the above is true
If you can't/don't want to use look-arounds, try
^.*\+.*$
It matches anything followed by a + followed by anything :)
See it work here at regex101.
Regards

Detecting sequencing using regexes

Imagine I have multiple character strings in a list like this:
[[1]]
[1] "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-"
[2] "-1-I2-1-TR-1-"
[3] "-1-I2-1-FA-1-I3-1-"
[4] "-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-"
[5] "-1-I2-1-"
[6] "-1-I2-1-FA-1-I2-1-"
[7] "-1-I3-1-FA-1-QU-1-"
[8] "-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-"
[9] "-1-I2-1-"
[10] "-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-"
[11] "-1-NR-1-QU-1-QU-1-I2-1-"
I want to use a regex to detect the particular strings where a certain substring precedes another substring, but not necessarily directly preceding the other substring.
For example, let's say that we are looking for FA preceding EX. This would need to match 1 in the list. Even though FA has -1-I2-1-I2-1-I2-1- between itself and EX, the FA still occurs before the EX, hence a match is expected.
How can a generic regex be defined that identifies strings where certain substrings appear before another substring in this manner?
You may use grep.
x <- c("1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-" ,"-1-I2-1-TR-1-")
grepl("FA.*EX", x)
#[1] TRUE FALSE
grep("FA.*EX", x)
#[1] 1

Cleaning up dates (years, specifically) with regex

I have database with an non-validated year field. Most of the entries are 4-digit years but about 10% of the entries are "whatever." This has led me down the rabbit hole of regular expressions to little avail. Getting better results than what I have is progress, even if I don't extract 100%.
#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))
The desired output is to grab the first readable year, so 1992-1993 would be 1992 and "the 70s" would be 1970.
How can I improve my parsing accuracy? Thanks!
EDIT: Pursuant to garyh's answer this gets me much closer:
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79" "07-2608" "07-262008" "96" "70" "93" "70" "15" "60" "70" NA "2013" "1992"
but note that, while the dates with dashes in them work with garyh's regex101.com demo, they don't work with R, keeping the month and day values, and the first dash.
Also I realize I didn't include an example date with slashes rather dashes. Another term in the regex should handle that but again, with R, it doesn't not produce the same (correct) result that regex101.com does.
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"
These negative lookbacks and lookaheads are very powerful but stretch my feeble brain.
Not sure what flavour of regex R uses but this seems to get all the years in the string
/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g
This is matching any 4 digits or any 2 digits provided they're not followed by a dash - or digit, or preceded by another digit
see demo here
You're going to need some elbow grease and do something like:
library(lubridate)
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")
yearEntries <-ifelse(!is.na(y), y, x)
yearEntries
## [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA "13" "92"
We have no idea which year you need from ranged entries, but this should get you started.
I found a very simple way to get a good result (though I would not claim it is bullet proof). It grabs the last readable year, which is okay too.
yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96 ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
# [1] "99" "08" "08" "96" "70" "95" "15" "ongoing" NA "13" "93"
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
# [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993
...and Bob's your uncle!
#garyth's regex work well actually if you use the regmatches/grexprcombo to extract the pattern instead of sub:
regmatches(yearEntries,
gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"
[[2]]
[1] "08"
[[3]]
[1] "2008"
[[4]]
[1] "96"
[[5]]
[1] "70"
[[6]]
[1] "95"
[[7]]
[1] "70"
[[8]]
[1] "15"
[[9]]
[1] "60"
[[10]]
[1] "70"
[[11]]
character(0)
[[12]]
[1] "2013"
[[13]]
[1] "1992" "1993"
To only keep the first matching pattern:
sapply(regmatches(yearEntries,gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE)),`[`,1)
[1] "79" "08" "2008" "96" "70" "95" "70" "15" "60" "70" NA "2013" "1992"
regmatches("07/09/13",gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}","07/09/13",perl=TRUE))
[[1]]
[1] "13"

difference between [0-9] n times and [0-9]{n} in R regexp

Both are supposed to the best of my knowledge to be the same but I actually see a difference, look at this minimal example from this question:
a<-c("/Cajon_Criolla_20141024","/Linon_20141115_20141130",
"/Cat/LIQUID",
"/c_puertas_20141206_20141107",
"/C_Puertas_3_20141017_20141018",
"/c_puertas_navidad_20141204_20141205")
sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
What am I missing? Or is this a bug?
This is a bug in the TRE library related to greedy modifiers and capture groups. See:
SO question with similar issue
Issue #11 on TRE repo
Issue #21.
Setting perl=TRUE gives the same answer (as expected) for both expressions:
> sub("(.*?)_([0-9]{8})(.*)$","\\2",a,perl=TRUE)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017" "20141204"
> sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a,perl=TRUE)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017" "20141204"
Though I was initially convinced by BrodieG answer, it seems that [0-9] n times and [0-9]{n} are indeed different, at least for the "tre" regexp motor. According to http://www.regular-expressions.info the {} operator is greedy, [0-9] is not.
Hence the right regular expression in my case should be:
sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)
Making all the difference:
sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
And even
> sub("(.*)_([0-9]{8}?)(.*)$","\\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
Interpretation: 1) tre considers ? as "evaluate next atom the first time you can match this atom". This is always true for ".?" as everything matches, and it switches to _[0-9]{8}. When reaching the first group of 6 numbers, if {} is not greedy (no ?), as (.) matches also the first 8 numbers, the search continues to see if an other occurrence of "_[0-9]{8}" can be found on the line. If meeting the second set of 8 figures, it also memorizes it as a matching pattern, then it reaches the end of the line, the last matching pattern is kept and [0-9]{8} is matched to the second set of 8 numbers.
2) When {} operator is modified by ? The search stops the first time it sees 8 numbers, check if _(.*) can be matched to the rest. It can, so it returns the first set of 8 numbers.
Note that the perl regexp motor works differently,
1) ? after {} doesn't change a thing:
sub("(.*)_([0-9]{8})","\\2",a,perl=TRUE)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.*)_([0-9]{8}?)","\\2",a,perl=TRUE)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
2) the ? applied to .* makes it to stop at the first set of 8 figures:
sub("(.*?)_([0-9]{8}).*","\\2",a,perl=TRUE)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
sub("(.*)_([0-9]{8}).*","\\2",a,perl=TRUE)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
From these two observations, it seems that the two engines interpret differently the greediness in two different instances. I always found the greediness concept to be a bit vague ...

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"
I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".
I know that sub and gsub can replace string, but what commands should I use to extract part of a string?
1) This extracts all numbers in order:
s <- "buy 1000 shares of Google at 1100 GBP"
library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)
giving:
[1] 1000 1100
2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:
strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100
The portion of the string matched by the part of the regular expression within parens is returned.
3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:
strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)
ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:
strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches
strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)
You can use the sub function:
s <- "buy 1000 shares of Google at 1100 GBP"
# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"
# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"
# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"
You can also use gregexpr and regmatches to extract all substrings at once:
regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+",
s, perl = TRUE))
# [[1]]
# [1] "1000" "Google" "1100"
I feel compelled to include the obligatory stringr solution as well.
library(stringr)
s <- "buy 1000 shares of Google at 1100 GBP"
str_match(s, "([0-9]+) shares")[2]
[1] "1000"
str_match(s, "([0-9]+) GBP")[2]
[1] "1100"
If you want to extract all digits from text use this function from stringi package.
"Nd" is the class of decimal digits.
stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"
[[2]]
[1] "43"
[[3]]
[1] "66" "123"
[[4]]
[1] NA
Please note that here 66 and 123 numbers are extracted separatly.