Confused with the locale settings in R

Confused with the locale settings in R - regex

Just now I answered this Removing characters after a EURO symbol in R question. But it's not working for me where the r code works for others who are on Ubuntu.
This is my code.
x <- "services as defined in this SOW at a price of € 15,896.80 (if executed fro"
euro <- "\u20AC"
gsub(paste(euro , "(\\S+)|."), "\\1", x)
# ""
I think this is all about changing the locale settings, I don't know how to do that.
I'm running rstudio on Windows 8.
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_3.2.0
#Anada's answer is good but we need to add that encoding parameter for every time when we use unicodes in regex. Is there any way to modify the default encoding to utf-8 on Windows?

Seems to be a problem with encoding.
Consider:
x <- "services as defined in this SOW at a price of € 15,896.80 (if executed fro"
gsub(paste(euro , "(\\S+)|."), "\\1", x)
# [1] ""
gsub(paste(euro , "(\\S+)|."), "\\1", `Encoding<-`(x, "UTF8"))
# [1] "15,896.80"

Related

rhandsontable does not display anything

I am using the rhandsontable library in R to create a table in my rmarkdown file. Just to test, I tried the example code in the package
library(rhandsontable)
DF = data.frame(val = 1:10, bool = TRUE, big = LETTERS[1:10],
small = letters[1:10],
dt = seq(from = Sys.Date(), by = "days", length.out = 10),
stringsAsFactors = FALSE)
rhandsontable(DF) %>%
hot_cols(columnSorting = TRUE)
However, this does not print anything in the Viewer tab of RStudio. I don't get any error message also. Inserting rhandsontable code into a chunk in the markdown down document also has the same result, I get a blank. I am not sure, what is missing, any help here would be very useful. My system info is as follows:
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "4.1"
$year
[1] "2017"
$month
[1] "06"
$day
[1] "30"
$`svn rev`
[1] "72865"
$language
[1] "R"
$version.string
[1] "R version 3.4.1 (2017-06-30)"
$nickname
[1] "Single Candle"
and the rhandsontable version is 0.3.5.

It seems the issue is resolved when I install the github version of the package
devtools::install_github("jrowen/rhandsontable", dependencies = T, upgrade_dependencies = T)

Find all words that have "<-" at the end of the word OR in front of a dot

How do I pull out all words that have the symbol "<-" either at the end of the word or somewhere in between but in the latter case only if the "<-" symbol is followed by a dot.
To put it into context. Exercise 6.5.3 a. of Hadley Wickhams - Advanced R asks the reader to list all replacement functions in the base package.
Replacement function that only have one method are indicated by the symbol <-
right at the end of the function name. Generic functions, however, have their
method name attached to the name of the replacement form (with a dot), such that the <- is no longer at the end of the function name. Example split<-.data.frame
EDIT:
obj <- mget(ls("package:base"), inherits = TRUE)
funs <- Filter(is.function, objs)
This is how you pull out all functions in the base package. Now I want to find only the replacement functions.

If you want all base package replacement functions and their respective S3 methods, you can try
ls(envir = as.environment("package:base"), pattern = "<-")
With no packages loaded, this gives the following result:
[1] "<<-" "<-" "[<-"
[4] "[[<-" "#<-" "$<-"
[7] "attr<-" "attributes<-" "body<-"
[10] "class<-" "colnames<-" "comment<-"
[13] "[<-.data.frame" "[[<-.data.frame" "$<-.data.frame"
[16] "[<-.Date" "diag<-" "dim<-"
[19] "dimnames<-" "dimnames<-.data.frame" "Encoding<-"
[22] "environment<-" "[<-.factor" "[[<-.factor"
[25] "formals<-" "is.na<-" "is.na<-.default"
[28] "is.na<-.factor" "is.na<-.numeric_version" "length<-"
[31] "length<-.factor" "levels<-" "levels<-.factor"
[34] "mode<-" "mostattributes<-" "names<-"
[37] "names<-.POSIXlt" "[<-.numeric_version" "[[<-.numeric_version"
[40] "oldClass<-" "parent.env<-" "[<-.POSIXct"
[43] "[<-.POSIXlt" "regmatches<-" "row.names<-"
[46] "rownames<-" "row.names<-.data.frame" "row.names<-.default"
[49] "split<-" "split<-.data.frame" "split<-.default"
[52] "storage.mode<-" "substr<-" "substring<-"
[55] "units<-" "units<-.difftime"
Thanks to #42 for helping me improve this answer.

We can try
library(stringr)
str_extract(v1, "\\w+<-$|\\w*<-\\.\\S+")
#[1] "split<-.data.frame" NA "splitdata<-"
data
v1 <- c("split<-.data.frame", "split<-data", "splitdata<-")

Extracting pattern from the nested list in R using regex

I have following sorted list (lst) of time periods and I want to split the periods into specific dates and then extract maximum time period without altering order of the list.
$`1`
[1] "01.12.2015 - 21.12.2015"
$`2`
[1] "22.12.2015 - 05.01.2016"
$`3`
[1] "14.09.2015 - 12.10.2015" "29.09.2015 - 26.10.2015"
Therefore, after adjustment list should look like this:
$`1`
[1] "01.12.2015" "21.12.2015"
$`2`
[1] "22.12.2015" "05.01.2016"
$`3`
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
In order to do so, I began with splitting the list:
lst_split <- str_split(lst, pattern = " - ")
which leads to the following:
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "c(\"14.09.2015" "12.10.2015\", \"29.09.2015" "26.10.2015\")"
Then, I tried to extract the pattern:
lapply(lst_split, function(x) str_extract(pattern = c("\\d+\\.\\d+\\.\\d+"),x))
but my output is missing one date (29.09.2015)
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "26.10.2015"
Does anyone have an idea how I could make it work and maybe propose more efficient solution? Thank you in advance.

Thanks to comments of #WiktorStribiżew and #akrun it is enough to use str_extract_all.
In this example:
> str_extract_all(lst,"\\d+\\.\\d+\\.\\d+")
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"

1) Use strsplit, flatten each component using unlist, convert the dates to "Date" class and then use range to get the maximum time span. No packages are used.
> lapply(lst, function(x) range(as.Date(unlist(strsplit(x, " - ")), "%d.%m.%Y")))
$`1`
[1] "2015-12-01" "2015-12-21"
$`2`
[1] "2015-12-22" "2016-01-05"
$`3`
[1] "2015-09-14" "2015-10-26"
2) This variation using a magrittr pipeline also works:
library(magrittr)
lapply(lst, function(x)
x %>%
strsplit(" - ") %>%
unlist %>%
as.Date("%d.%m.%Y") %>%
range
)
Note: The input lst in reproducible form is:
lst <- structure(list(`1` = "01.12.2015 - 21.12.2015", `2` = "22.12.2015 - 05.01.2016",
`3` = c("14.09.2015 - 12.10.2015", "29.09.2015 - 26.10.2015"
)), .Names = c("1", "2", "3"))

R: need to replace invisible/accented characters with regex

I'm working with a file generated from several different machines that had different locale-settings, so I ended up with a column of a data frame with different writings for the same word:
CÃ“RDOBA
CÓRDOBA
CÒRDOBA
I'd like to convert all those to CORDOBA. I've tried doing
t<-gsub("Ã“|Ó|Ã’|Â°|°|Ò","O",t,ignore.case = T) # t is the vector of names
Wich works until it finds some "invisible" characters:
As you can see, I'm not able to see, in R, the additional charater that lies between Ã and \ (If I copy-paste to MS Word, word shows it with an empty rectangle). I've tried to dput the vector, but it shows exactly as in screen (without the "invisible" character).
I ran Encoding(t), and ir returns unknown for all values.
My system configuration follows:
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Colombia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.7-12 dplyr_0.4.2 data.table_1.9.4
loaded via a namespace (and not attached):
[1] R6_2.1.0 assertthat_0.1 magrittr_1.5 plyr_1.8.3 parallel_3.2.1 DBI_0.3.1 tools_3.2.1 reshape2_1.4.1 Rcpp_0.11.6 stringi_0.5-5
[11] grid_3.2.1 stringr_1.0.0 chron_2.3-47 lattice_0.20-31
I've saveRDS a file with a data frame of actual and expected toy values, wich could be loadRDS from here. I'm not absolutely sure it will load with the same problems I have (depending on you locale), but I hope it does, so you can provide some help.
At the end, I'd like to convert all those special characters to unaccented ones (Ó to O, etc.), hopefully without having to manually input each one of the special ones into a regex (in other words, I'd like --if possible-- some sort of gsub("[:weird:]","[:equivalentToWeird:]",t). If not possible, at least I'd like to be able to find (and replace) those "invisible" characters.
Thanks,
############## EDIT TO ADD ###################
If I run the following code:
d<-readRDS("c:/path/to(downloaded/Dropbox/file/inv_char.Rdata")
stri_escape_unicode(d$actual)
This is what I get:
[1] "\\u00c3\\u201cN N\\u00c2\\u00b0 08 \\\"CACIQUE CALARC\\u00c3\\u0081\\\" - ARMENIA"
[2] "\\u00d3N N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA"
[3] "\\u00d3N N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA(ALTERNO)"
Normal output is:
> d$actual
[1] Ã“N NÂ° 08 "CACIQUE CALARCÃ" - ARMENIA ÓN N° 08 "CACIQUE CALARCÁ" - ARMENIA ÓN N° 08 "CACIQUE CALARCÁ" - ARMENIA(ALTERNO)

With the help of #hadley, who pointed me towards stringi, I ended up discovering the offending characters and replacing them. This was my initial attempt:
unweird<-function(t){
t<-stri_escape_unicode(t)
t<-gsub("\\\\u00c3\\\\u0081|\\\\u00c1","A",t)
t<-gsub("\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8","E",t)
t<-gsub("\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc","I",t)
t<-gsub("\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba","O",t)
t<-gsub("\\\\u00c3\\\\u2018|\\\\u00d1","N",t)
t<-gsub("\\u00a0|\\u00c2\\u00a0","",t)
t<-gsub("\\\\u00f3","o",t)
t<-stri_unescape_unicode(t)
}
which produced the expected result. I was a little bit curious about other stringi functions, so I wondered if its substitution one could be faster on my 3.3 million rows. I then tried stri_replace_all_regex like this:
stri_unweird<-function(t){
stri_unescape_unicode(stri_replace_all_regex(stri_escape_unicode(t),
c("\\\\u00c3\\\\u0081|\\\\u00c1",
"\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8",
"\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc",
"\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba",
"\\\\u00c3\\\\u2018|\\\\u00d1",
"\\u00a0|\\u00c2\\u00a0",
"\\\\u00f3"),
c("A","E","I","O","N","","o"),
vectorize_all = F))
}
As a side note, I ran microbenchmark on both methods, these are the results:
g<-microbenchmark(unweird(t),stri_unweird(t),times = 100L)
summary(g)
min lq mean median uq max neval cld
1 423.0083 425.6400 431.9609 428.1031 432.6295 490.7658 100 b
2 118.5831 119.5057 121.2378 120.3550 121.8602 138.3111 100 a

Parsing out a line in R to pick different objects

I have this line:
system<-c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
I need to parse out this and pick smt, lcpu, mem, mpsize and ent into different objects.
For example, I doing this to pick the smt, but it picks the whole line, any ideas what I am doing wrong here?
smt<-sub('^.* smt=([[:digit:]])', '\\1', system)
smt needs to have a number 4 in this case.

I would use strsplit a couple times, and type.convert:
parse.config <- function(x) {
clean <- sub("System configuration: ", "", x)
pairs <- strsplit(clean, " ")[[1]]
items <- strsplit(pairs, "=")
keys <- sapply(items, `[`, 1)
values <- sapply(items, `[`, 2)
values <- lapply(values, type.convert, as.is = TRUE)
setNames(values, keys)
}
config <- parse.config(system)
# $type
# [1] "Shared"
#
# $mode
# [1] "Uncapped"
#
# $smt
# [1] 4
#
# $lcpu
# [1] 96
#
# $mem
# [1] "393216MB"
#
# $psize
# [1] 64
#
# $ent
# [1] 16
The output is a list so you can access any of the parsed items, for example:
config$smt
# [1] 4

Using strapplyc in the gusbfn package the following creates a list L whose names are the left hand sides such as smt and whose values are the right hand sides.
library(gsubfn)
LHS <- strapplyc( system, "(\\w+)=" )[[1]]
RHS <- strapplyc( system, "=(\\w+)" )[[1]]
L <- setNames( as.list(RHS), LHS )
For example we can now get smt like this (and similarly for the other left hand sides):
> L$smt
[1] "4"
UPDATE: Simplified.

add .* to the end of your matching expression and you'll get "4".
sub('^.* smt=([[:digit:]]+).*', '\\1', system)
You may want to add the + I included in the instance where it is more than a single digit.
You could also approach this by splitting on spaces and the finding the matches:
splits <- unlist(strsplit(system, ' '))
sub('smt=', '', grep('smt=', splits, value=TRUE))
# [1] "4"
or wrapping it in a function:
matchfun <- function(string, to_match, splitter=' ') {
splits <- unlist(strsplit(string, splitter))
sub(to_match, '', grep(to_match, splits, value=TRUE))
}
matchfun(system, 'smt=')
# [1] "4"

Well, I'm voting for #GaborGrothendieck's, but am offering this as a more pedestrian alternative:
inp <- c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
inparsed <- read.table(text=inp, stringsAsFactors=FALSE)
vals <- unlist(inparsed)[grep("\\=", unlist(inparsed))]
vals
# V3 V4 V5 V6 V7 V8 V9
# type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00
vals[grep("smt|lcpu|mem|mpsize|ent", vals)]
V5 V6 V7 V9
"smt=4" "lcpu=96" "mem=393216MB" "ent=16.00"
I would note that choosing the name 'system' for a variable seems most unwise in light of the system function's existence.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Confused with the locale settings in R - regex

Seems to be a problem with encoding. Consider: x <- "services as defined in this SOW at a price of € 15,896.80 (if executed fro" gsub(paste(euro , "(\\S+)|."), "\\1", x) # [1] "" gsub(paste(euro , "(\\S+)|."), "\\1", `Encoding<-`(x, "UTF8")) # [1] "15,896.80"

Related

rhandsontable does not display anything

Find all words that have "<-" at the end of the word OR in front of a dot

Extracting pattern from the nested list in R using regex

R: need to replace invisible/accented characters with regex

Parsing out a line in R to pick different objects

Categories

Resources