How to remove the symbol "\" in a string? - regex

a <- "1 \"US\", 2 \"UK\", 3 \"GE\""
I hope to get the following results:
1 US
2 UK
3 GE

Seems like you want something like this,
> a <- c("1 \"US\"","2 \"UK\"","3 \"GE\"")
> gsub("\"", "", a)
[1] "1 US" "2 UK" "3 GE"
OR
> a <- "1 \"US\", 2 \"UK\", 3 \"GE\""
> gsub("\"", "", a)
[1] "1 US, 2 UK, 3 GE"
> gsub("\"|,", "", a)
[1] "1 US 2 UK 3 GE"
\" are usually used to mean a double quotes.

There are no slashes in your string (length 1 character vector).
> cat(a)
1 "US", 2 "UK", 3 "GE"
The slashes you see are to escape the double quotes which otherwise would close the string. Compare what it would look like if you were using single quotes to delimit a string (in which case a double quote would not close it):
> identical(a, '1 "US", 2 "UK", 3 "GE"')
[1] TRUE
If you want to remove the commas,
> gsub(",", "", a)
[1] "1 \"US\" 2 \"UK\" 3 \"GE\""
If you want to display it without having it printed as a delimited string and without escaping things in it, use cat. You can even do both.
> cat(gsub(",", "", a))
1 "US" 2 "UK" 3 "GE"

Related

using toOrdinal to replace numbers with ordinals

I have a sequence of addresses and I am trying to replace numbers with ordinals. Right now I have the following.
library(toOrdinal)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
numstringc<-gsub("\\D", "", addlist)
numstring <-as.integer(numstringc)
ordstring<-sapply(numstring[!is.na(numstring)], toOrdinal)
ordstring
[1] "1st" "4th" "5th" "43rd"
I want to eventually get a vector that says
[1] "east 1st street", "4th ave", "5th blvd", "plaza", "43rd lane"
but I can't figure out how to make that.
With \\1 you can access the part of the matched expression in paranthesis, but gsub doesn't allow functions in the replacement, so you have to use gsubfn from the package by the same name, which actually doesn't need the \\1 part:
library(gsubfn)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
ordstring <- gsubfn("[0-9]+", function (x) toOrdinal(as.integer(x)), addlist)
Alternatively you can use gregexpr and regmatches, to replace them:
m <- gregexpr("[0-9]+", addlist)
regmatches(addlist, m) <- sapply(as.integer(regmatches(addlist,m)), toOrdinary)

How to remove the [1]s, [[1]]s and double quotes from a csv data in R?

I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:
V1
1 [1] 789
2 [[1]]
3 [1] "PNG" "D115" "DX06" "Slz"
4 [1] 787
5 [[1]]
6 [1] "D010" "HC"
7 [1] 949
8 [[1]]
9 [1] "HC" "DX06"
(I don't know why all that wasted space between line number and the output data)
I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):
789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06
(possibly the 789 and its corresponding data PNG,D115,DX06,Slz should be separated by a tab.. and like that for each row)
How to achieve this in R?
We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.
indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
function(x) data.frame(ind=x[1],
val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))
# ind val
#1 789 PNG,D115,DX06,Slz
#2 787 D010,HC
#3 949 HC,DX06
data
df1 <- structure(list(V1 = c("[1] 789", "[[1]]",
"[1] \"PNG\" \"D115\" \"DX06\" \"Slz\"",
"[1] 787", "[[1]]", "[1] \"D010\" \"HC\"", "[1] 949",
"[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1",
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9"))
Honestly, a command-line fix using either sed/perl/egrep -o is less pain:
sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv

How to separate thousands with space [duplicate]

This question already has answers here:
Comma separator for numbers in R?
(4 answers)
Closed 7 years ago.
I would like to format number so that the every thousand should be separated with a space.
What I've tried :
library(magrittr)
addSpaceSep <- function(x) {
x %>%
as.character %>%
strsplit(split = NULL) %>%
unlist %>%
rev %>%
split(ceiling(seq_along(.) / 3)) %>%
lapply(paste, collapse = "") %>%
paste(collapse = " ") %>%
strsplit(split = NULL) %>%
unlist %>%
rev %>%
paste(collapse = "")
}
> sapply(c(1, 12, 123, 1234, 12345, 123456, 123456, 1234567), addSpaceSep)
[1] "1" "12" "123" "1 234" "12 345" "123 456" "123 456"
[8] "1 234 567"
> sapply(c(1, 10, 100, 1000, 10000, 100000, 1000000), addSpaceSep)
[1] "1" "10" "100" "1 000" "10 000" "1e +05" "1e +06"
I feel very bad to have written this makeshift function but as I haven't mastered regular expressions, it's the only way I found to do it. And of course it won't work if the number is converted in a scientific format.
This seems like a much better fit for the format() function rather than bothering with regular expressions. The format() function exists to format numbers
format(c(1, 12, 123, 1234, 12345, 123456, 123456, 1234567), big.mark=" ", trim=TRUE)
# [1] "1" "12" "123" "1 234" "12 345" "123 456"
# [7] "123 456" "1 234 567"
format(c(1, 10, 100, 1000, 10000, 100000, 1000000), big.mark=" ", scientific=FALSE, trim=TRUE)
# [1] "1" "10" "100" "1 000" "10 000" "100 000"
# [7] "1 000 000"
x<-100000000
prettyNum(x,big.mark=" ",scientific=FALSE)
[1] "100 000 000"
I agree with the other answers that using other tools (such as format) is the best approach. But if you really want to use a regular expression and substitution, then here is an approach that works using Perl's look ahead.
> test <- c(1, 12, 123, 1234, 12345, 123456, 1234567, 12345678)
>
> gsub('(\\d)(?=(\\d{3})+(\\D|$))', '\\1 ',
+ as.character(test), perl=TRUE)
[1] "1" "12" "123" "1 234"
[5] "12 345" "123 456" "1 234 567" "12 345 678"
Basically it looks for a digit that is followed by 1 or more sets of 3 digits (followed by a non-digit or the end of string) and replaces the digit with itself plus a space (the look ahead part does not appear in the replacement because it is not part of the match, more a condition on the match).

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.