Matching the first digits in R - regex

Anyone know a good way to match and categorize the first n digits of a number in R?
For example,
123451
123452
123461
123462
In this case, the if we match on the first n=1-4 digits, we would get all the same group. If we match with n=5 digits, we would get 2 groups.
I thought about doing this by making the numeric vector a character vector, splitting it so that each number is an element that can then be truncated to n digits, and matching based on those digits; however, I have a lot of numbers, and it seems there must be a better way to sort only the first n digits of a number in R. Any thoughts?
Thanks!

Here's a vectorised solution that does not involve conversion to character:
nums <- c(123451,
123452,
123461,
123462)
firstDigits <- function(x, n) {
ndigits <- floor(log10(x)) + 1
floor(x / 10^(ndigits - n))
}
factor(firstDigits(nums, 4))
## [1] 1234 1234 1234 1234
## Levels: 1234
factor(firstDigits(nums, 5))
## [1] 12345 12345 12346 12346
## Levels: 12345 12346
factor(firstDigits(nums, 6))
## [1] 123451 123452 123461 123462
## Levels: 123451 123452 123461 123462

Related

Extracting and merging numbers from strings

I have strings with numbers as follow:
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
and would like to get an output like:
97226424979
815264627
4902022801986
0781482789
06643420034
060418728
I tried using:
as.numeric(gsub("([0-9]+).*$", "\\1", numbers))
but the numbers are separate in the output.
To get your exact output,
#to avoid scientific notation
options(scipen=999)
#find which have leading 0
ind <- which(substring(x, 1, 1) == 0)
y <- as.numeric(gsub("\\D", "", numbers))
y[ind] <- paste0('0', y[ind])
y
#[1] "97226424979" "815264627" "4902022801986" "0781482789" "06643420034" "060418728"
([0-9]+).*$ puts a number sequence until the first non-number into \\1. However, you want:
numbers <- readLines(n=6)
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
as.numeric(gsub("\\D", "", numbers))
This replaces all non-numbers by nothing.

Get the numeric characters from alphanumeric string in R?

Possible duplicate: 1 2
I read the above discussions.
I want to get all numerical characters from alphanumerical string using R?
My Code:
> y <- c()
> x <- c("wXYz04516", "XYz24060", "AB04512", "wCz04110", "wXYz04514", "wXYz04110")
> for (i in 1:length(x)){
+ y <- c(as.numeric(gsub("[a-zA-Z]", "", x[i])),y)
+ }
> print (y)
[1] 4110 4514 4110 4512 24060 4516
Here it outputs the all numerical charters, but fail to get starting number zero ("0")
The output omits starting Zero ("0") digit in case of 4110, 4514, 4110, 4512, and 4516.
How can I get digit zero included before the numbers?
Leading zeroes are not allowed on whole numeric values. So to have the leading zeros, you'll have to leave them as character. You can, however, print them without quotes if you want.
x <- c("wXYz04516", "XYz24060", "AB04512", "wCz04110", "wXYz04514")
gsub("\\D+", "", x)
# [1] "04516" "24060" "04512" "04110" "04514"
as.numeric(gsub("\\D+", "", x))
# [1] 4516 24060 4512 4110 4514
print(gsub("\\D+", "", x), quote = FALSE)
# [1] 04516 24060 04512 04110 04514
So the last one looks like a numeric, but is actually a character.
Side note: gsub() and as.numeric() are both vectorized functions, so there's also no need for a for() loop in this operation.
If you want the leading zeroes, you will need to create a character vector instead of numeric one, so change as.numeric to as.character.

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Extract with Specific Prefix and control no. of digits in Regex- R

I need to extract the below pattern in R
(10 digits), prefix with 3, 5, 9 (e.g. 3234567890, 5234567890, 9234567890)
(10 digits), prefix with 4 (e.g. 4234567890)
(10 digits), prefix with 8 (e.g. 8234567890)
and the below
TAM(5 digits) – e.g. TAM12345 (numbers starting with TAM and 5 digits)
E(7 digits) – e.g. E1234567 (numbers starting with E and only 7 digits)
A(5 digits) – e.g. A12345 (numbers starting with A and only 5 digits)
I use stingr library.
I am able to extract numbers (with alpha)- not sure how to give specific Prefix and to restrict the digits
The email is below
These are the notice number - with high priority
3234567890 and 5234567890 and the long pending issue 9234567890 along with the discuused numbers 4234567890,8234567890.
Special messages from TAM12345,E1234567 and A12345
Required Output
3234567890, 5234567890, 9234567890
4234567890
8234567890
TAM12345
E1234567
A12345
You could try the below code which uses word boundary \b. Word boundary is used to match between a word character and a non-word character.
> library(stringr)
> str_extract_all(x, perl('\\b(?:[35948]\\d{9}|TAM\\d{5}|E\\d{7}|A\\d{5})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
Using the stringr library:
> library(stringr)
> str_extract_all(x, perl('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
Using the gsubfn library:
> library(gsubfn)
> strapply(x, '\\b([3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', perl=T)
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
And base R which handles this just as well.
> regmatches(x, gregexpr('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', x, perl=T))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.