How to determine if character string contains non-Roman characters in R

How to determine if character string contains non-Roman characters in R - regex

What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?

You could use regex/grep to check for hex values of characters outside the range of printable ASCII characters:
x <- 'ないでさ'
grep( "[^\x20-\x7F]",x )
#[1] 1
grep( "[^\x20-\x7F]","Normal text" )
#integer(0)
If you wanted to allow the non-printing ("control") character to be considered "English", you could extend the range of the character class in hte first argument to grep to start with "\x01". See ?regex for more information on using character class argumets. See ?Quotes for more information about how to specify characters as Unicode, hexadecimal, or octal values.
The R.oo package has conversion functions that may be useful:
library(R.oo)
?intToChar
?charToInt
The fact that Henrik Bengtsson saw fit to include these in his package says to me that there is no a handy method to do this in base/default R. He's a long-time useR/guRu.
Seeing the other answer prompted this effort which seems straight-forward:
> is.na( iconv( c(x, "OrdinaryASCII") , "", "ASCII") )
[1] TRUE FALSE

You could determine if string contains non-Latin/non-ASCII characters with iconv and grep
# My example, because you didn't add your data
characters <- c("ないでさ, satisfação, катынь, Work, Awareness, Potential, für")
# First you convert string to vector of words
characters.unlist <- unlist(strsplit(characters, split=", "))
# Then find indices of words with non-ASCII characters using ICONV
characters.non.ASCII <- grep("characters.unlist", iconv(characters.unlist, "latin1", "ASCII", sub="characters.unlist"))
# subset original vector of words to exclude words with non-ASCII characters
data <- characters.unlist[-characters.non.ASCII]
# convert vector back to a string
dat.1 <- paste(data, collapse = ", ")
# Now if you run
characters.non.ASCII
[1] 1 2 3 7
That means that the first, second, third and seventh indices are non-ASCII characters, in my case 1, 2, 3 and 7 correspond to: "ないでさ, satisfação, катынь and für.
You could also run
dat.1 #and the output will be all ASCII charaters
[1] "Work, Awareness, Potential"

Related

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy

Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.

I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

R: Substituting character strings coerced from numbers in scientific notation with gsub

I need to export som data to a text file in another programming language where the numbers can't exceed 14 digits. Not all elements need to be comma seperated so this is why I use this method.
The problem is that gsub doesn't reconize the number 42 when coerced to a character string and the scientific notation option scipen is set low enough, so 42 gets printed in E-notation.
Here scipen=-10 so 42 is printed in E-notation.
x <- 4.2e+1 # The meaning of life
options(scipen = -10)
gsub(pattern=x,replacement=paste(",",x),x,useBytes=TRUE)
[1] "4.2e+01"
gsub(pattern=x,replacement=paste(",",x),x,useBytes=FALSE)
[1] "4.2e+01"
It is like gsub doesn't reconize the match. I have also tried,
gsub(pattern=x,replacement=paste(",",x),as.character(x))
but with no luck.
In the following two examples gsub acts as expected, and the scipen=0 is high enough to ensure 42 is printed as 42.
x <- 4.2e+1 # Still the meaning of life
options(scipen = 0)
gsub(pattern=x,replacement=paste(",",x),x,useBytes=TRUE)
[1] ", 42"
gsub(pattern=x,replacement=paste(",",x),x,useBytes=FALSE)
[1] ", 42"
As you can see the useBytes option doens't help either. Can someone please tell me what I am not getting.
Thanks.

The characters . and + are predefined regex characters. Hence, they are not interpreted literally. You have to escape these characters in your pattern (with \\). Then, it will work.
x <- 4.2e+1 # The meaning of life
options(scipen = -10)
x_pat <- gsub("(\\+|\\.)", "\\\\\\1", x)
# [1] "4\\.2e\\+01"
gsub(x_pat, paste(",", x), x)
# [1] ", 4.2e+01"
Another possibility is to use the argument fixed = TRUE. This matches the pattern string as is.
gsub(x, paste(",", x), x, fixed = TRUE)
# [1] ", 4.2e+01"

How to detect a string in all caps and convert it to start case

Say I have the following vector
x <- c('One', 'TWO', 'THREE / FOUR')
I want to convert TWO and THREE / FOUR to Two and Three / Four, respectively. I've taken a look into casefold() and the whole chartr() help page but couldn't figure this out.
In my real problem, I have a vector of 1500 strings in which I intend to detect entries written in all caps (I know many of them include a slash just like the one in the example above) and convert them to start case.
One thing I can do is run grepl('^[A-Z]+$', x) (as suggested by tenub), but it doesn't detect the THREE / FOUR as being all caps (it yields [1] FALSE TRUE FALSE). From what I've seen, just the presence of a space is enough to have this return FALSE.
Removing the anchor grepl('[A-Z]+$', x) (as suggested by TheGreatCO) works for the example above, but fails in the next:
y <- "Imposto Territorial Rural - ITR"
grepl('[A-Z]+', y)
[1] TRUE
Moreover, elements containing accents are always left out, no matter what I try:
z <- c('Á')
grepl('[A-Z]+', z)
[1] FALSE

Part of this is a demo example in the package gsubfn. You can run it after installing the package with demo(gsubfn::gsubfn-lower).
x <- c('One', 'TWO', 'THREE / FOUR', 'ÁÁÁ')
library(gsubfn)
## find indices of vector where there are no lowercase letters
## (therefore all letters must be uppercase)
idx <- grep("[[:lower:]]", x, invert = TRUE)
## in these indices, run tolower on characters
## that do not follow a word boundary \\B
x[idx] <- gsubfn("\\B.", tolower, x[idx], perl = TRUE)
# [1] "One" "Two" "Three / Four" "Ááá"
Both \B and [:lower:] are locale-dependent by Sys.getlocale("LC_CTYPE"). Mine is "English_United States.1252". Your mileage may vary.

I don't know R so well, but I base this answer in the description of gsub and regular expression support given in this document
gsub("([A-Z])([:alpha:]*)", paste(\1,tolower(\2),sep=""), x)
I am not sure if you have to enclose \1 and \2 with quotes, try it and if it does not work try it with the quotes around \1 and \2

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.

1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index

It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]

If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.

The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.

Remove from a string all except selected characters

I want to remove from a string all characters that are not digits, minus signs, or decimal points.
I imported data from Excel using read.xls, which include some strange characters. I need to convert these to numeric. I am not too familiar with regular expressions, so need a simpler way to do the following:
excel_coords <- c(" 19.53380Ý°", " 20.02591°", "-155.91059°", "-155.8154Â°")
unwanted <- unique(unlist(strsplit(gsub("[0-9]|\\.|-", "", excel_coords), "")))
clean_coords <- gsub(do.call("paste", args = c(as.list(unwanted), sep="|")),
replacement = "", x = excel_coords)
> clean_coords
[1] "19.53380" "20.02591" "-155.91059" "-155.8154"
Bonus if somebody can tell me why these characters have appeared in some of my data (the degree signs are part of the original Excel worksheet, but the others are not).

Short and sweet. Thanks to comment by G. Grothendieck.
gsub("[^-.0-9]", "", excel_coords)
From http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html: "A character class is a list of characters enclosed between [ and ] which matches any single character in that list; unless the first character of the list is the caret ^, when it matches any character not in the list."

Can also be done by using strsplit, sapply and paste and by indexing the correct characters rather than the wrong ones:
excel_coords <- c(" 19.53380Ý°", " 20.02591°", "-155.91059°", "-155.8154Â°")
correct_chars <- c(0:9,"-",".")
sapply(strsplit(excel_coords,""),
function(x)paste(x[x%in%correct_chars],collapse=""))
[1] "19.53380" "20.02591" "-155.91059" "-155.8154"

gsub("(.+)([[:digit:]]+\\.[[:digit:]]+)(.+)", "\\2", excel_coords)
[1] "9.53380" "0.02591" "5.91059" "5.8154"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to determine if character string contains non-Roman characters in R - regex

What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?

Related

Correct wrongly formatted dates

R: Substituting character strings coerced from numbers in scientific notation with gsub

How to detect a string in all caps and convert it to start case

Regular expressions in R to erase all characters after the first space?

Remove from a string all except selected characters

Categories

Resources