R: Substituting character strings coerced from numbers in scientific notation with gsub - regex

I need to export som data to a text file in another programming language where the numbers can't exceed 14 digits. Not all elements need to be comma seperated so this is why I use this method.
The problem is that gsub doesn't reconize the number 42 when coerced to a character string and the scientific notation option scipen is set low enough, so 42 gets printed in E-notation.
Here scipen=-10 so 42 is printed in E-notation.
x <- 4.2e+1 # The meaning of life
options(scipen = -10)
gsub(pattern=x,replacement=paste(",",x),x,useBytes=TRUE)
[1] "4.2e+01"
gsub(pattern=x,replacement=paste(",",x),x,useBytes=FALSE)
[1] "4.2e+01"
It is like gsub doesn't reconize the match. I have also tried,
gsub(pattern=x,replacement=paste(",",x),as.character(x))
but with no luck.
In the following two examples gsub acts as expected, and the scipen=0 is high enough to ensure 42 is printed as 42.
x <- 4.2e+1 # Still the meaning of life
options(scipen = 0)
gsub(pattern=x,replacement=paste(",",x),x,useBytes=TRUE)
[1] ", 42"
gsub(pattern=x,replacement=paste(",",x),x,useBytes=FALSE)
[1] ", 42"
As you can see the useBytes option doens't help either. Can someone please tell me what I am not getting.
Thanks.

The characters . and + are predefined regex characters. Hence, they are not interpreted literally. You have to escape these characters in your pattern (with \\). Then, it will work.
x <- 4.2e+1 # The meaning of life
options(scipen = -10)
x_pat <- gsub("(\\+|\\.)", "\\\\\\1", x)
# [1] "4\\.2e\\+01"
gsub(x_pat, paste(",", x), x)
# [1] ", 4.2e+01"
Another possibility is to use the argument fixed = TRUE. This matches the pattern string as is.
gsub(x, paste(",", x), x, fixed = TRUE)
# [1] ", 4.2e+01"

Related

r ngram extraction with regex

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

How to detect a string in all caps and convert it to start case

Say I have the following vector
x <- c('One', 'TWO', 'THREE / FOUR')
I want to convert TWO and THREE / FOUR to Two and Three / Four, respectively. I've taken a look into casefold() and the whole chartr() help page but couldn't figure this out.
In my real problem, I have a vector of 1500 strings in which I intend to detect entries written in all caps (I know many of them include a slash just like the one in the example above) and convert them to start case.
One thing I can do is run grepl('^[A-Z]+$', x) (as suggested by tenub), but it doesn't detect the THREE / FOUR as being all caps (it yields [1] FALSE TRUE FALSE). From what I've seen, just the presence of a space is enough to have this return FALSE.
Removing the anchor grepl('[A-Z]+$', x) (as suggested by TheGreatCO) works for the example above, but fails in the next:
y <- "Imposto Territorial Rural - ITR"
grepl('[A-Z]+', y)
[1] TRUE
Moreover, elements containing accents are always left out, no matter what I try:
z <- c('Á')
grepl('[A-Z]+', z)
[1] FALSE
Part of this is a demo example in the package gsubfn. You can run it after installing the package with demo(gsubfn::gsubfn-lower).
x <- c('One', 'TWO', 'THREE / FOUR', 'ÁÁÁ')
library(gsubfn)
## find indices of vector where there are no lowercase letters
## (therefore all letters must be uppercase)
idx <- grep("[[:lower:]]", x, invert = TRUE)
## in these indices, run tolower on characters
## that do not follow a word boundary \\B
x[idx] <- gsubfn("\\B.", tolower, x[idx], perl = TRUE)
# [1] "One" "Two" "Three / Four" "Ááá"
Both \B and [:lower:] are locale-dependent by Sys.getlocale("LC_CTYPE"). Mine is "English_United States.1252". Your mileage may vary.
I don't know R so well, but I base this answer in the description of gsub and regular expression support given in this document
gsub("([A-Z])([:alpha:]*)", paste(\1,tolower(\2),sep=""), x)
I am not sure if you have to enclose \1 and \2 with quotes, try it and if it does not work try it with the quotes around \1 and \2

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.

Remove from a string all except selected characters

I want to remove from a string all characters that are not digits, minus signs, or decimal points.
I imported data from Excel using read.xls, which include some strange characters. I need to convert these to numeric. I am not too familiar with regular expressions, so need a simpler way to do the following:
excel_coords <- c(" 19.53380Ý°", " 20.02591°", "-155.91059°", "-155.8154°")
unwanted <- unique(unlist(strsplit(gsub("[0-9]|\\.|-", "", excel_coords), "")))
clean_coords <- gsub(do.call("paste", args = c(as.list(unwanted), sep="|")),
replacement = "", x = excel_coords)
> clean_coords
[1] "19.53380" "20.02591" "-155.91059" "-155.8154"
Bonus if somebody can tell me why these characters have appeared in some of my data (the degree signs are part of the original Excel worksheet, but the others are not).
Short and sweet. Thanks to comment by G. Grothendieck.
gsub("[^-.0-9]", "", excel_coords)
From http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html: "A character class is a list of characters enclosed between [ and ] which matches any single character in that list; unless the first character of the list is the caret ^, when it matches any character not in the list."
Can also be done by using strsplit, sapply and paste and by indexing the correct characters rather than the wrong ones:
excel_coords <- c(" 19.53380Ý°", " 20.02591°", "-155.91059°", "-155.8154°")
correct_chars <- c(0:9,"-",".")
sapply(strsplit(excel_coords,""),
function(x)paste(x[x%in%correct_chars],collapse=""))
[1] "19.53380" "20.02591" "-155.91059" "-155.8154"
gsub("(.+)([[:digit:]]+\\.[[:digit:]]+)(.+)", "\\2", excel_coords)
[1] "9.53380" "0.02591" "5.91059" "5.8154"