Extracting and merging numbers from strings - regex

I have strings with numbers as follow:
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
and would like to get an output like:
97226424979
815264627
4902022801986
0781482789
06643420034
060418728
I tried using:
as.numeric(gsub("([0-9]+).*$", "\\1", numbers))
but the numbers are separate in the output.

To get your exact output,
#to avoid scientific notation
options(scipen=999)
#find which have leading 0
ind <- which(substring(x, 1, 1) == 0)
y <- as.numeric(gsub("\\D", "", numbers))
y[ind] <- paste0('0', y[ind])
y
#[1] "97226424979" "815264627" "4902022801986" "0781482789" "06643420034" "060418728"

([0-9]+).*$ puts a number sequence until the first non-number into \\1. However, you want:
numbers <- readLines(n=6)
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
as.numeric(gsub("\\D", "", numbers))
This replaces all non-numbers by nothing.

Related

Matching the first digits in R

Anyone know a good way to match and categorize the first n digits of a number in R?
For example,
123451
123452
123461
123462
In this case, the if we match on the first n=1-4 digits, we would get all the same group. If we match with n=5 digits, we would get 2 groups.
I thought about doing this by making the numeric vector a character vector, splitting it so that each number is an element that can then be truncated to n digits, and matching based on those digits; however, I have a lot of numbers, and it seems there must be a better way to sort only the first n digits of a number in R. Any thoughts?
Thanks!
Here's a vectorised solution that does not involve conversion to character:
nums <- c(123451,
123452,
123461,
123462)
firstDigits <- function(x, n) {
ndigits <- floor(log10(x)) + 1
floor(x / 10^(ndigits - n))
}
factor(firstDigits(nums, 4))
## [1] 1234 1234 1234 1234
## Levels: 1234
factor(firstDigits(nums, 5))
## [1] 12345 12345 12346 12346
## Levels: 12345 12346
factor(firstDigits(nums, 6))
## [1] 123451 123452 123461 123462
## Levels: 123451 123452 123461 123462

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

Get the numeric characters from alphanumeric string in R?

Possible duplicate: 1 2
I read the above discussions.
I want to get all numerical characters from alphanumerical string using R?
My Code:
> y <- c()
> x <- c("wXYz04516", "XYz24060", "AB04512", "wCz04110", "wXYz04514", "wXYz04110")
> for (i in 1:length(x)){
+ y <- c(as.numeric(gsub("[a-zA-Z]", "", x[i])),y)
+ }
> print (y)
[1] 4110 4514 4110 4512 24060 4516
Here it outputs the all numerical charters, but fail to get starting number zero ("0")
The output omits starting Zero ("0") digit in case of 4110, 4514, 4110, 4512, and 4516.
How can I get digit zero included before the numbers?
Leading zeroes are not allowed on whole numeric values. So to have the leading zeros, you'll have to leave them as character. You can, however, print them without quotes if you want.
x <- c("wXYz04516", "XYz24060", "AB04512", "wCz04110", "wXYz04514")
gsub("\\D+", "", x)
# [1] "04516" "24060" "04512" "04110" "04514"
as.numeric(gsub("\\D+", "", x))
# [1] 4516 24060 4512 4110 4514
print(gsub("\\D+", "", x), quote = FALSE)
# [1] 04516 24060 04512 04110 04514
So the last one looks like a numeric, but is actually a character.
Side note: gsub() and as.numeric() are both vectorized functions, so there's also no need for a for() loop in this operation.
If you want the leading zeroes, you will need to create a character vector instead of numeric one, so change as.numeric to as.character.

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)