Get the numeric characters from alphanumeric string in R? - regex

Possible duplicate: 1 2
I read the above discussions.
I want to get all numerical characters from alphanumerical string using R?
My Code:
> y <- c()
> x <- c("wXYz04516", "XYz24060", "AB04512", "wCz04110", "wXYz04514", "wXYz04110")
> for (i in 1:length(x)){
+ y <- c(as.numeric(gsub("[a-zA-Z]", "", x[i])),y)
+ }
> print (y)
[1] 4110 4514 4110 4512 24060 4516
Here it outputs the all numerical charters, but fail to get starting number zero ("0")
The output omits starting Zero ("0") digit in case of 4110, 4514, 4110, 4512, and 4516.
How can I get digit zero included before the numbers?

Leading zeroes are not allowed on whole numeric values. So to have the leading zeros, you'll have to leave them as character. You can, however, print them without quotes if you want.
x <- c("wXYz04516", "XYz24060", "AB04512", "wCz04110", "wXYz04514")
gsub("\\D+", "", x)
# [1] "04516" "24060" "04512" "04110" "04514"
as.numeric(gsub("\\D+", "", x))
# [1] 4516 24060 4512 4110 4514
print(gsub("\\D+", "", x), quote = FALSE)
# [1] 04516 24060 04512 04110 04514
So the last one looks like a numeric, but is actually a character.
Side note: gsub() and as.numeric() are both vectorized functions, so there's also no need for a for() loop in this operation.

If you want the leading zeroes, you will need to create a character vector instead of numeric one, so change as.numeric to as.character.

Related

Extracting and merging numbers from strings

I have strings with numbers as follow:
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
and would like to get an output like:
97226424979
815264627
4902022801986
0781482789
06643420034
060418728
I tried using:
as.numeric(gsub("([0-9]+).*$", "\\1", numbers))
but the numbers are separate in the output.
To get your exact output,
#to avoid scientific notation
options(scipen=999)
#find which have leading 0
ind <- which(substring(x, 1, 1) == 0)
y <- as.numeric(gsub("\\D", "", numbers))
y[ind] <- paste0('0', y[ind])
y
#[1] "97226424979" "815264627" "4902022801986" "0781482789" "06643420034" "060418728"
([0-9]+).*$ puts a number sequence until the first non-number into \\1. However, you want:
numbers <- readLines(n=6)
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
as.numeric(gsub("\\D", "", numbers))
This replaces all non-numbers by nothing.

How to modify string in R taking into account the number of symbols you want to modify [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 6 years ago.
This question is very easy to understand, but I can't wrap my head around how to get a solution. Let's say I have a vector and I want to modify it so it would have 5 integers at the end, and missing digits are replaced with zeros:
Smth1 Smth00001
Smth22 Smth00022
Smth333 Smth00333
Smth4444 Smth04444
Smth55555 Smth55555
I guess it can be done with regex and functions like gsub, but don't understand how to take into account the length of the replaced string
Here's an idea using stringi:
v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
library(stringi)
d <- stri_extract(v, regex = "[:digit:]+")
a <- stri_extract(v, regex = "[:alpha:]+")
paste0(a, stri_pad_left(d, 5, "0"))
Which gives:
[1] "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Using base R. Someone else can prettify the regex:
sprintf("%s%05d", gsub("^([^0-9]+)..*$", "\\1", x),
as.numeric(gsub("^..*[^0-9]([0-9]+)$", "\\1", x)))
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Here is a simple 1-line solution similar to Zelazny's but using a replace callback method inside a gsubfn using gsubfn library:
> library(gsubfn)
> v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
> gsubfn('[0-9]+$', ~ sprintf("%05d",as.numeric(x)), v)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
The regex [0-9]+$ (see the regex demo) matches 1 or more digits at the end of the string only due to the $ anchor. The matched digits are passed to the callback (~) and sprintf("%05d",as.numeric(x)) pads the number (parsed as a numeric with as.numeric) with zeros.
To only modify strings that have 1+ non-digit symbols at the start and then 1 or more digits up to the end, just use this PCRE-based gsubfn:
> gsubfn('^[^0-9]+\\K([0-9]+)$', ~ sprintf("%05d",as.numeric(x)), v, perl=TRUE)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
where
^ - start of string
[^0-9]+\\K - matches 1+ non-digit symbols and \K will omit them
([0-9]+) - Group 1 passed to the callback
$ - end of string.
Here a solution using the library stringr:
library(stringr)
library(dplyr)
num <- str_extract(v, "[1-9]+")
padding <- 9 - nchar(num)
ouput <- paste0(str_extract(v, "[^0-9]+") %>%
str_pad(width = padding, side = c("right"), pad = "0"), num)
The output is:
"Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
library(stringr)
paste0(str_extract(v,'\\D+'),str_pad(str_extract(v,'\\d+'),5,'left', '0'))
#[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"

Split string on [:punct:] except for underscore in R

I have an equation as a string where the variables in the string equation are variables in the R workspace. I would like to replace each variable with its numeric value in the R workspace. This is easy enough when the variable names don't contain punctuation.
Here is a simple example.
x <- 5
y <- 10
yy <- 15
z <- x*(y + yy)
zAsChar <- "z=x*(y+yy)"
vars <- unlist(strsplit(zAsChar, "[[:punct:]]"))
notVars <- unlist(strsplit(zAsChar, "[^[:punct:]]"))
varsValues <- sapply(vars[vars != ""], FUN=function(aaa) get(aaa))
notVarsValues <- notVars[notVars != ""]
paste(paste0(varsValues, notVarsValues), collapse="")
This yields "125=5*(10+15)", which is great.
However, I would love the option to use underscores in the variable names so that I can use "subscripts" for variable names. I am using these strings in math mode in R markdown.
So I need a [:punct:] that excludes _. I tried using [\\+\\-\\*\\/\\(\\)\\=] rather than [:punct:], but with this approach I couldn't split on the minus sign. Is there a way to preserve the _?
Instead of [:punct:] use the unicode character class \pP (shortcut for \p{P}) and its negation \PP to do that:
[^\\PP_]
(It works with perl=TRUE option)
Are you sure you need to do all this string manipulation? The substitute() function can help you out
substitute(z==x*(y+yy), list(x=x, y=y, yy=yy,z=z))
Or if you really need to start with a character value
do.call("substitute", list(parse(text=zAsChar)[[1]],list(x=x, y=y, yy=yy,z=z)))
# 125 = 5 * (10 + 15)
You can deparse() the result to turn it back into a character.

How to fill gap between two characters with regex

I have a data set like below. I would like to replace all dots between two 1's with 1's, as shown in the desired.result. Can I do this with regex in base R?
I tried:
regexpr("^1\\.1$", my.data$my.string, perl = TRUE)
Here is a solution in c#
Characters between two exact characters
Thank you for any suggestions.
my.data <- read.table(text='
my.string state
................1...............1. A
......1..........................1 A
.............1.....2.............. B
......1.................1...2..... B
....1....2........................ B
1...2............................. C
..........1....................1.. C
.1............................1... C
.................1...........1.... C
........1....2.................... C
......1........................1.. C
....1....1...2.................... D
......1....................1...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
desired.result <- read.table(text='
my.string state
................11111111111111111. A
......1111111111111111111111111111 A
.............1.....2.............. B
......1111111111111111111...2..... B
....1....2........................ B
1...2............................. C
..........1111111111111111111111.. C
.111111111111111111111111111111... C
.................1111111111111.... C
........1....2.................... C
......11111111111111111111111111.. C
....111111...2.................... D
......1111111111111111111111...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
Below is an option using gsub with the \G feature and lookaround assertions.
> gsub('(?:1|\\G(?<!^))\\K\\.(?=\\.*1)', '1', my.data$my.string, perl = TRUE)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."
# [7] "..........1111111111111111111111.." ".111111111111111111111111111111..."
# [9] ".................1111111111111...." "........1....2...................."
# [11] "......11111111111111111111111111.." "....111111...2...................."
# [13] "......1111111111111111111111......" ".................1...2............"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. Since it seems you want to avoid the dots at the start of the string position we use a lookaround assertion \G(?<!^) to exclude the start of the string.
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.
You can find an overall breakdown that explains the regular expression here.
Using gsubfn, the first argument is a regular expression which matches the 1's and the characters between the 1's and captures the latter. The second argument is a function, expressed in formula notation, which uses gsub to replace each character in the captured string with 1:
library(gsubfn)
transform(my.data, my.string = gsubfn("1(.*)1", ~ gsub(".", 1, x), my.string))
If there can be multiple pairs of 1's in a string then use "1(.*?)1" as the regular expression instead.
Visualization The regular expression here is simple enough that it can be directly understood but here is a debuggex visualization anwyays:
1(.*)1
Debuggex Demo
Here is an option that uses a relatively simple regex and the standard combination of gregexpr(), regmatches(), and regmatches<-() to identify, extract, operate on, and then replace substrings matching that regex.
## Copy the character vector
x <- my.data$my.string
## Find sequences of "."s bracketed on either end by a "1"
m <- gregexpr("(?<=1)\\.+(?=1)", x, perl=TRUE)
## Standard template for operating on and replacing matched substrings
regmatches(x,m) <- sapply(regmatches(x,m), function(X) gsub(".", "1", X))
## Check that it worked
head(x)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)