Regex matching everything that's not a 4 digit number

Regex matching everything that's not a 4 digit number - regex

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"

regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).

It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"

I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning

Related

R regmatches() and stringr str_extract() dragging whitespaces along

Here's the thing:
test=" 2 15 3 23 12 0 0.18"
#I want to extract the 1st number separately
pattern="^ *(\\d+) +"
d=regmatches(test,gregexpr(pattern,test))
> d
[[1]]
[1] " 2 "
library(stringr)
f=str_extract(test,pattern)
> f
[1] " 2 "
They both bring whitespaces to the result despite usage of ()-brackets. Why? The brackets are for specifying which part of the matched pattern you want, am I wrong? I know I can trim them with trimws() or coerce them directly to numeric, but I wonder if I misunderstand some mechanics of patterns.

Using str_match (or str_match_all)
Since you want to extract a capture group, you can use str_match (or str_match_all). str_extract only extracts whole matches.
From R stringr help:
str_match Extract matched groups from a string.
and
str_extract to extract the complete match
R code:
library(stringr)
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
f=str_match(test,pattern)
f[[2]]
## [1] "2"
The f[[2]] will output the 2nd item that is the first capture group value.
Using regmatches
As it is mentioned in the comment above, it is also possible with regmatches and regexec:
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
res <- regmatches(test,regexec(pattern,test))
res[[1]][2] // The res list contains all matches and submatches
## [1] "2" // We get the item[2] from the first match to get "2"
See regexec help page that says:
regexec returns a list of the same length as text each element of which is either -1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute "match.length" a vector giving the lengths of the matches (or -1 for no match).
OP task specific solution
Actually, since you only are interested in 1 integer number in the beginning of a string, you could achieve what you want with a mere gsub:
> gsub("^ *(\\d+) +.*", "\\1", test)
[1] "2"

R get first letters of double/tripple-barrel surnames in data.frame

I have a dataframe with 2 columns:
> df1
Surname Name
1 The Builder Bob
2 Zeta-Jones Catherine
I want to add a third column "Shortened_Surname" which contains the first letters of all the words in the surname field:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
Note the "-" in the second name. I have barreled surnames separated by spaces and hyphens.
I have tried:
Step1:
> strsplit(unlist(as.character(df1$Surname))," ")
[[1]]
[1] "The" "Builder"
[[2]]
[1] "Zeta-Jones"
My research suggests I could possibly use strtrim as a Step 2, but all I have found is a number of ways how not to do it.

You can target the space, hyphen, and beginning of the line with lookarounds. For instance, you any character (.) not preceded by the beginning of the line, a space, or a hyphen should be substituted to "":
with(df, gsub("(?<!^|[ -]).", "", Surname, perl=TRUE))
[1] "TB" "ZJ"
or
with(df, gsub("(?<=[^ -]).", "", Surname, perl=TRUE))
The second gsub substitutes a blank ("") for any character that is preceded by a character that is not a " " or "-".

You can try this, if the format of the names is as show in the input data:
library(stringr)
df$Shortened_Surname <- sapply(str_extract_all(df$Surname, '[A-Z]{1}'), function(x) paste(x, collapse = ''))
Output is as follows:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
If the format of the names is somewhat inconsistent, you will need to modify the above pattern to capture that. You can use |, & operators inside the pattern to combine multiple patterns.

Extracting capturing groups from a regex

This regex: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])* matches an expression using multiple groups. The point of the regex is that it captures patterns in pairs of two, where the first part of the regex has to be followed by the second part of the regex.
How can I extract each of these two groups?
library(stringr)
data <- c("A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7")
str_extract_all(data, "(.*?)(?:I[0-9]-)*I3(?:-I[0-9])*")
Gives me:
[[1]]
[1] "A-B-C-I1-I2-D-E-F-I1-I3" "-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7"
However, I would want something along the lines of:
[[1]]
[1] "A-B-C-I1-I2-D-E-F" [2] "I1-I3"
[[2]]
[1] "D-D-D-D" [2] "I1-I1-I2-I1-I1-I3-I3-I7"
The key here is that regex matches twice, each time containing 2 groups. I want every match to have a list of it's own, and that list to contain 2 elements, one for each group.

You need to wrap a capturing group around the second part of your expression and if you're using stringr for this task, I would use str_match_all instead to return the captured matches ...
library(stringr)
data <- c('A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7')
mat <- str_match_all(data, '-?(.*?)-((?:I[0-9]-)*I3(?:-I[0-9])*)')[[1]][,2:3]
colnames(mat) <- c('Group 1', 'Group 2')
# Group 1 Group 2
# [1,] "A-B-C-I1-I2-D-E-F" "I1-I3"
# [2,] "D-D-D-D" "I1-I1-I2-I1-I1-I3-I3-I7"

How to allow for arbitrary number of wildcards in regexes?

I have a list of character strings:
> head(g_patterns_clean_strings)
[[1]]
[1] "1FAFA"
[[2]]
[1] "FA,TRFA"
[[3]]
[1] "FAEX"
I am trying to identify specific patterns in these character strings, as such:
library(devtools)
g_patterns_clean <- source_gist("164f798524fd6904236a")[[1]]
g_patterns_clean_strings <- source_gist("af70a76691aacf05c1bb")[[1]]
FA_EX_logic_vector <- grepl(g_patterns_clean_strings, pattern = "(FAEX|EXFA)+")
FA_EX_cluster <- subset(g_patterns_clean, FA_EX_logic_vector)
Let's now say that I want to allow for an arbitrary number of other characters in between FA and EX (or EX and FA), how can I specify that in the regex above?

This is a flexible generalization of #eipi10's answer:
(FA.{0,2}EX|EX.{0,2}FA)
The . matches any character, and the {0,2} quantifier matches between 0 and 2 occurrences of .

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"
I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".
I know that sub and gsub can replace string, but what commands should I use to extract part of a string?

1) This extracts all numbers in order:
s <- "buy 1000 shares of Google at 1100 GBP"
library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)
giving:
[1] 1000 1100
2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:
strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100
The portion of the string matched by the part of the regular expression within parens is returned.
3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:
strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)
ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:
strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches
strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)

You can use the sub function:
s <- "buy 1000 shares of Google at 1100 GBP"
# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"
# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"
# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"
You can also use gregexpr and regmatches to extract all substrings at once:
regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+",
s, perl = TRUE))
# [[1]]
# [1] "1000" "Google" "1100"

I feel compelled to include the obligatory stringr solution as well.
library(stringr)
s <- "buy 1000 shares of Google at 1100 GBP"
str_match(s, "([0-9]+) shares")[2]
[1] "1000"
str_match(s, "([0-9]+) GBP")[2]
[1] "1100"

If you want to extract all digits from text use this function from stringi package.
"Nd" is the class of decimal digits.
stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"
[[2]]
[1] "43"
[[3]]
[1] "66" "123"
[[4]]
[1] NA
Please note that here 66 and 123 numbers are extracted separatly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex matching everything that's not a 4 digit number - regex

It's possible to capture group in regex using (). Taking the same example str12 <- "coihr 1234 &/()= jngm 34 ljd" gsub(".\\s(\\d{4})\\s.", "\\1", str12) [1] "1234"

Related

R regmatches() and stringr str_extract() dragging whitespaces along

R get first letters of double/tripple-barrel surnames in data.frame

Extracting capturing groups from a regex

How to allow for arbitrary number of wildcards in regexes?

How to extract a part from a string in R

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex matching everything that's not a 4 digit number - regex

It's possible to capture group in regex using (). Taking the same example str12 <- "coihr 1234 &/()= jngm 34 ljd" gsub(".*\\s(\\d{4})\\s.*", "\\1", str12) [1] "1234"

Related

R regmatches() and stringr str_extract() dragging whitespaces along

R get first letters of double/tripple-barrel surnames in data.frame

Extracting capturing groups from a regex

How to allow for arbitrary number of wildcards in regexes?

How to extract a part from a string in R

Categories

Resources

It's possible to capture group in regex using (). Taking the same example str12 <- "coihr 1234 &/()= jngm 34 ljd" gsub(".\\s(\\d{4})\\s.", "\\1", str12) [1] "1234"