This question already has answers here:
Comma separator for numbers in R?
(4 answers)
Closed 7 years ago.
I would like to format number so that the every thousand should be separated with a space.
What I've tried :
library(magrittr)
addSpaceSep <- function(x) {
x %>%
as.character %>%
strsplit(split = NULL) %>%
unlist %>%
rev %>%
split(ceiling(seq_along(.) / 3)) %>%
lapply(paste, collapse = "") %>%
paste(collapse = " ") %>%
strsplit(split = NULL) %>%
unlist %>%
rev %>%
paste(collapse = "")
}
> sapply(c(1, 12, 123, 1234, 12345, 123456, 123456, 1234567), addSpaceSep)
[1] "1" "12" "123" "1 234" "12 345" "123 456" "123 456"
[8] "1 234 567"
> sapply(c(1, 10, 100, 1000, 10000, 100000, 1000000), addSpaceSep)
[1] "1" "10" "100" "1 000" "10 000" "1e +05" "1e +06"
I feel very bad to have written this makeshift function but as I haven't mastered regular expressions, it's the only way I found to do it. And of course it won't work if the number is converted in a scientific format.
This seems like a much better fit for the format() function rather than bothering with regular expressions. The format() function exists to format numbers
format(c(1, 12, 123, 1234, 12345, 123456, 123456, 1234567), big.mark=" ", trim=TRUE)
# [1] "1" "12" "123" "1 234" "12 345" "123 456"
# [7] "123 456" "1 234 567"
format(c(1, 10, 100, 1000, 10000, 100000, 1000000), big.mark=" ", scientific=FALSE, trim=TRUE)
# [1] "1" "10" "100" "1 000" "10 000" "100 000"
# [7] "1 000 000"
x<-100000000
prettyNum(x,big.mark=" ",scientific=FALSE)
[1] "100 000 000"
I agree with the other answers that using other tools (such as format) is the best approach. But if you really want to use a regular expression and substitution, then here is an approach that works using Perl's look ahead.
> test <- c(1, 12, 123, 1234, 12345, 123456, 1234567, 12345678)
>
> gsub('(\\d)(?=(\\d{3})+(\\D|$))', '\\1 ',
+ as.character(test), perl=TRUE)
[1] "1" "12" "123" "1 234"
[5] "12 345" "123 456" "1 234 567" "12 345 678"
Basically it looks for a digit that is followed by 1 or more sets of 3 digits (followed by a non-digit or the end of string) and replaces the digit with itself plus a space (the look ahead part does not appear in the replacement because it is not part of the match, more a condition on the match).
Related
I have a dataframe mydf:
Content term
1 Search Term: abc| NA
2 Search Term-xyz NA
3 Search Term-pqr| NA
Made a regex:
\Search Term[:]?.?([a-zA-Z]+)\
to get terms like abc xyz and pqr.
How do I extract these terms in the term column. I tried str_match and gsub, but not getting the correct results.
We can try with sub
sub(".*(\\s+|-)", "", df1$Content)
#[1] "abc" "xyz" "pqr"
Or
library(stringr)
str_extract(df1$Content, "\\w+$")
#[1] "abc" "xyz" "pqr"
Update
If the | is also found in the string at the end
gsub(".*(\\s+|-)|[^a-z]+$", "", df1$Content)
#[1] "abc" "xyz" "pqr"
Or
str_extract(df1$Content, "\\w+(?=(|[|])$)")
#[1] "abc" "xyz" "pqr"
Just to demonstrate the word function of stringr,
library(stringr)
df$term <- gsub('.*-', '', word(df$Content, -1))
gsub('[[:punct:]]', '', df$term)
#[1] "abc" "xyz" "pqr"
'gsub' will help you
content <- c("Search Term: abc|", "Search Term-xyz", "Search Term-pqr|")
term <- c(NA, NA, NA)
test123 <- as.data.frame(cbind(content, term))
test123$term <- as.character(gsub(".*(\\s+|-)|[^a-z]+$", "", test123$content))
test123
content term
1 Search Term: abc| abc
2 Search Term-xyz xyz
3 Search Term-pqr| pqr
I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:
V1
1 [1] 789
2 [[1]]
3 [1] "PNG" "D115" "DX06" "Slz"
4 [1] 787
5 [[1]]
6 [1] "D010" "HC"
7 [1] 949
8 [[1]]
9 [1] "HC" "DX06"
(I don't know why all that wasted space between line number and the output data)
I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):
789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06
(possibly the 789 and its corresponding data PNG,D115,DX06,Slz should be separated by a tab.. and like that for each row)
How to achieve this in R?
We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.
indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
function(x) data.frame(ind=x[1],
val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))
# ind val
#1 789 PNG,D115,DX06,Slz
#2 787 D010,HC
#3 949 HC,DX06
data
df1 <- structure(list(V1 = c("[1] 789", "[[1]]",
"[1] \"PNG\" \"D115\" \"DX06\" \"Slz\"",
"[1] 787", "[[1]]", "[1] \"D010\" \"HC\"", "[1] 949",
"[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1",
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9"))
Honestly, a command-line fix using either sed/perl/egrep -o is less pain:
sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv
a <- "1 \"US\", 2 \"UK\", 3 \"GE\""
I hope to get the following results:
1 US
2 UK
3 GE
Seems like you want something like this,
> a <- c("1 \"US\"","2 \"UK\"","3 \"GE\"")
> gsub("\"", "", a)
[1] "1 US" "2 UK" "3 GE"
OR
> a <- "1 \"US\", 2 \"UK\", 3 \"GE\""
> gsub("\"", "", a)
[1] "1 US, 2 UK, 3 GE"
> gsub("\"|,", "", a)
[1] "1 US 2 UK 3 GE"
\" are usually used to mean a double quotes.
There are no slashes in your string (length 1 character vector).
> cat(a)
1 "US", 2 "UK", 3 "GE"
The slashes you see are to escape the double quotes which otherwise would close the string. Compare what it would look like if you were using single quotes to delimit a string (in which case a double quote would not close it):
> identical(a, '1 "US", 2 "UK", 3 "GE"')
[1] TRUE
If you want to remove the commas,
> gsub(",", "", a)
[1] "1 \"US\" 2 \"UK\" 3 \"GE\""
If you want to display it without having it printed as a delimited string and without escaping things in it, use cat. You can even do both.
> cat(gsub(",", "", a))
1 "US" 2 "UK" 3 "GE"
I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.
I am trying to write a regex that would match and capture the following for me ...
String: 17+18+19+5+21
Numbers to be captured here (separately) are present in the array - [17,18,21].
Please note that the string can be n character long (following the same pattern of \d+) and the order of these numbers in the string are not fixed.
Thanks in advance
Given this setup:
library(gsubfn)
s <- "17+18+19+5+21"
a <- c(17, 18, 21)
1) Try this:
L <- as.list(c(setNames(a, a), NA))
strapply(s, "\\d+", L, simplify = na.omit)
giving:
[1] 17 18 21
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"
2) or this:
pat <- paste(a, collapse = "|")
strapplyc(s, pat, simplify = as.numeric)
giving:
[1] 17 18 21
3) or this non-regexp solution
intersect(scan(text = s, what = 0, sep = "+", quiet = TRUE), a)
giving
[1] 17 18 21
ADDED additional solution.
How about simply:
(17|18|21)
It needs to be a global match, so in Pearl it would be like this:
$string =~ m/(17|18|21)/g
Example string:
21+18+19+5+21+18+19+17
Matches:
"21", "18", "21", "18", "17"
Working regex example:
http://regex101.com/r/jL8iF7
Use can use gregexpr and regmatches:
vec <- "17+18+19+5+21"
a <- c(17, 18, 21)
pattern <- paste0("\\b(", paste(a, collapse = "|"), ")\\b")
# [1] "\\b(17|18|21)\\b"
regmatches(vec, gregexpr(pattern, vec))[[1]]
# [1] "17" "18" "21"
Note that this matches the exact number, i.e., 17 does not match 177.