R - Extracting number from string with regular expression

R - Extracting number from string with regular expression - regex

I want to extract a number with decimals from a string with one single expression if possible.
For example transform "2,123.02" to "2123.02" - my current solution is:
paste(unlist(str_extract_all("2,123.02","\\(?[0-9.]+\\)?",simplify=F)),collapse="")
But what I'm looking for is the expression in str_extract_all to just bind it together as a vector by themself. Is this possible to achieve with an regular expression?

You can try replacing the comma by an empty string:
gsub(",", "", "2,123.02")
#[1] "2123.02"
NB: If you need to replace only commas in between numbers, you can use lookarounds:
gsub("(?<=[0-9]),(?=[0-9])", "", "this, this is my number 2,123.02", perl=TRUE)
#[1] "this, this is my number 2123.02"
I edited with sub instead of gsub in case you have strings with more than one number with a comma. In case you only have one, sub is "sufficient".
NB2: You can call str_extrac_all on the result from gsub, e.g.:
str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", "","first number: 2,123.02, second number: 3,456", perl=T), "\\d+\\.*\\d*", simplify=F)
#[[1]]
#[1] "2123.02" "3456"

Another option is extract_numeric in the tidyr package.
library(tidyr)
extract_numeric("2,123.02")
[1] 2123.02

Related

Return only matching portion of regular expression

I have:
> pattern
[1] "(/[[:digit:]]{4}/)"
so I want to extract only the matching portions...the digits plus the /.../. Here's what I tried:
> gsub(pattern, '\\1', grep(pattern, c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs"), value=TRUE))
[1] "t3tg3wgw/5764/" "grsgs/gwgew/5656/bfsbs"
However this still returns letters attached to the actual match that do not themselves match the regex. How can I extract only /5764/ and /5656/?

We could extract the pattern / followed by one or more numbers ([0-9]+) followed by / using str_extract_all from library(stringr) to output a list, which can be unlisted to convert to vector
library(stringr)
unlist(str_extract_all(v1, '/[0-9]+/'))
#[1] "/5764/" "/5656/"
Or we use the same pattern and using regmatches/gregexpr from base R
unlist(regmatches(v1, gregexpr('/[0-9]+/',v1)))
#[1] "/5764/" "/5656/"
data
v1 <- c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs")

Try changing the pattern to .*(/[[:digit:]]{4}/).*

Retrieve digits after specific string in R

I have a bunch of strings that contain the word "radius" followed by one or two digits. They also contain a lot of other letters, digits, and underscores. For example, one is "inflow100_radius6_distance12". I want a regex that will just return the one or two digits following "radius." If R recognized \K, then I would just use this:
radius\K[0-9]{1,2}
and be done. But R doesn't allow \K, so I ended up with this instead (which selects radius and the following numbers, and then cuts off "radius"):
result <- regmatches(input_string, gregexpr("radius[0-9]{1,2}", input_string))
result <- unlist(substr(result, 7, 8)))
I'm pretty new to regex, so I'm sure there's a better way. Any ideas?

\K is recognized. You can solve the problem by turning on the perl = TRUE parameter.
result <- regmatches(x, gregexpr('radius\\K\\d+', x, perl=T))

1) Match the entire string replacing it with the digits after radius:
sub(".*radius(\\d+).*", "\\1", "inflow100_radius6_distance12")
## [1] "6"
The regular expression can be visualized as follows:
.*radius(\d+).*
Debuggex Demo
2) This also works, involves a simpler regular expression and converts it to numeric at the same time:
library(gsubfn)
strapply("inflow100_radius6_distance12", "radius(\\d+)", as.numeric, simplify = TRUE)
## [1] 6
Here is a visualization of the regular expression:
radius(\d+)
Debuggex Demo

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!

Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.

How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.

Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.

For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"

Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))

strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.

Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

R: Extract data from string using POSIX regular expression

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?
st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."
First of all, this generates an error
Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"
I was thinking something like
sub(".*\\.", st, "")

The first problem is that you need to escape the \ in your string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
As for the main problem, this will return the bit you want from the string you gave:
> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"
But a simpler solution would be to split on the \\. and select the last chunk:
> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"
or slightly more automated
> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"

Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.
However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.
Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.
gsub("\\.", "", gsub("(.+?\\.){2}", "", st))
[1] "DATABASE_NAME"

An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
library(stringr)
str_split(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R - Extracting number from string with regular expression - regex

Another option is extract_numeric in the tidyr package. library(tidyr) extract_numeric("2,123.02") [1] 2123.02

Related

Return only matching portion of regular expression

Retrieve digits after specific string in R

Extract subset of a string following specific text in R

Extract capture group matches from regular expressions? (or: where is gregexec?)

R: Extract data from string using POSIX regular expression

Categories

Resources