Concatenate gsub [duplicate] - regex

This question already has answers here:
Replace multiple letters with accents with gsub
(11 answers)
Closed 9 years ago.
I'm currently running the following code to clean my data from accent characters:
df <- gsub('Á|Ã', 'A', df)
df <- gsub('É|Ê', 'E', df)
df <- gsub('Í', 'I', df)
df <- gsub('Ó|Õ', 'O', df)
df <- gsub('Ú', 'U', df)
df <- gsub('Ç', 'C', df)
However, I would like to do it in just one line (using another function for it would be ok). How can I do this?

Try something like this
iconv(c('Á'), "utf8", "ASCII//TRANSLIT")
You can just add more elements to the c().
EDIT: it is machine dependent, check help(iconv)
Here is the R solution
mychar <- c('ÁÃÉÊÍÓÕÚÇ')
iconv(mychar, "latin1", "ASCII//TRANSLIT") # one line, as requested
[1] "AAEEIOOUC"

It an encoding problem, Normally you resolve it by indicating the right encoding. If you still want to use regular expression to do it , you can use gsubfn to write one liner solution:
library(gsubfn)
ll <- list('Á'='A', 'Ã'='A', 'É'='E',
'Ê'='E', 'Í'='I', 'Ó'='O',
'Õ'='O', 'Ú'='U', 'Ç'='C')
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,'ÁÃÉÊÍÓÕÚÇ')
[1] "AAEEIOOUC"
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,c('ÁÃÉÊÍÓÕÚÇ','ÍÓÕÚÇ'))
[1] "AAEEIOOUC" "IOOUC"

One option could be chartr
> toreplace <- LETTERS
> replacewith <- letters
> (somestring <- paste(sample(LETTERS,10),collapse=""))
[1] "MUXJVYNZQH"
>
> chartr(
+ old=paste(toreplace,collapse=""),
+ new=paste(replacewith,collapse=""),
+ x=somestring
+ )
[1] "muxjvynzqh"

df = as.data.frame(apply(df,2,function(x) gsub('Á|Ã', 'A', df)))
2 indicates columns and 1 indicates rows

Related

Alternative for matching specific characters in regex [duplicate]

This question already has answers here:
Optimize long lists of fixed string alternatives in regex
(2 answers)
Closed 6 months ago.
I have made a regex for matching the specific letters:
a, ae, eo, e, eu, ya, yae, yeo, ye, yo, o, oe, wa, wae, wo, we, wi, yu, u, ui, i, oo, ah
This is the solution that I made a[eh]||e[ou]|o[eo]|u[i]|w[aoei]|y[aeou]|[aeiou]. Is there any alternative solution that I could use to improve its performance or a better solution for this?
I think, there is no other solution to create regex of such type of pattern. But you may create this regex dynamically like this way . . .
arr = ['a', 'ae', 'eo', 'e', 'eu', 'ya', 'yae', 'yeo', 'ye', 'yo', 'o', 'oe', 'wa', 'wae', 'wo', 'we', 'wi', 'yu', 'u', 'ui', 'i', 'oo', 'ah'];
s = ''
arr.forEach(x => s = s + (s ? '|' : '') + x)
reg = new RegExp(s, 'gi')
// /a|ae|eo|e|eu|ya|yae|yeo|ye|yo|o|oe|wa|wae|wo|we|wi|yu|u|ui|i|oo|ah/gi

How to modify string in R taking into account the number of symbols you want to modify [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 6 years ago.
This question is very easy to understand, but I can't wrap my head around how to get a solution. Let's say I have a vector and I want to modify it so it would have 5 integers at the end, and missing digits are replaced with zeros:
Smth1 Smth00001
Smth22 Smth00022
Smth333 Smth00333
Smth4444 Smth04444
Smth55555 Smth55555
I guess it can be done with regex and functions like gsub, but don't understand how to take into account the length of the replaced string
Here's an idea using stringi:
v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
library(stringi)
d <- stri_extract(v, regex = "[:digit:]+")
a <- stri_extract(v, regex = "[:alpha:]+")
paste0(a, stri_pad_left(d, 5, "0"))
Which gives:
[1] "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Using base R. Someone else can prettify the regex:
sprintf("%s%05d", gsub("^([^0-9]+)..*$", "\\1", x),
as.numeric(gsub("^..*[^0-9]([0-9]+)$", "\\1", x)))
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Here is a simple 1-line solution similar to Zelazny's but using a replace callback method inside a gsubfn using gsubfn library:
> library(gsubfn)
> v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
> gsubfn('[0-9]+$', ~ sprintf("%05d",as.numeric(x)), v)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
The regex [0-9]+$ (see the regex demo) matches 1 or more digits at the end of the string only due to the $ anchor. The matched digits are passed to the callback (~) and sprintf("%05d",as.numeric(x)) pads the number (parsed as a numeric with as.numeric) with zeros.
To only modify strings that have 1+ non-digit symbols at the start and then 1 or more digits up to the end, just use this PCRE-based gsubfn:
> gsubfn('^[^0-9]+\\K([0-9]+)$', ~ sprintf("%05d",as.numeric(x)), v, perl=TRUE)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
where
^ - start of string
[^0-9]+\\K - matches 1+ non-digit symbols and \K will omit them
([0-9]+) - Group 1 passed to the callback
$ - end of string.
Here a solution using the library stringr:
library(stringr)
library(dplyr)
num <- str_extract(v, "[1-9]+")
padding <- 9 - nchar(num)
ouput <- paste0(str_extract(v, "[^0-9]+") %>%
str_pad(width = padding, side = c("right"), pad = "0"), num)
The output is:
"Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
library(stringr)
paste0(str_extract(v,'\\D+'),str_pad(str_extract(v,'\\d+'),5,'left', '0'))
#[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"

Replace element in string after first occurrence

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.
This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.
This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

Split or substitute strings with wildcards in R [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 6 years ago.
I have the following vector:
a <- c("abc_lvl1", "def_lvl2")
I basically want to split into two vectors:
("abc", "def") and ("lvl1", "lvl2). I know how to substitute with sub:
sub(".*_", "", a)
[1] "lvl1" "lvl2"
I think this translates into "Search for any number of any characters before "_" and replace with nothing." Accordingly - i thought - this should give me the other desired vector:
sub("_*.", "", a), but it removes just the leading character:
[1] "bc_lvl1" "ef_lvl2"
Where do i mess up?
This is essentially the equivalent for the "text-to-columns"-function in excel.
There are several ways to do this. Here are a few, some using packages, and others with base R.
Given:
a <- c("abc_lvl1", "def_lvl2")
Here are some options:
do.call(rbind, strsplit(a, "_", TRUE))
matrix(scan(what = "", text = a, sep = "_"), ncol = 2, byrow = TRUE)
scan(text = a, sep = "_", what = list("", "")) ## a list
library(splitstackshape)
cSplit(data.table(a), "a", "_")
library(data.table)
setDT(tstrsplit(a, "_"))[]
library(dplyr)
library(tidyr)
data_frame(a) %>%
separate(a, into = c("this", "that"))
library(reshape2)
colsplit(a, "_", c("this", "that"))
library(stringi)
t(stri_split_fixed(a, "_", simplify = TRUE))
library(iotools)
mstrsplit(a, "_") # Matrix
dstrsplit(a, col_types = c("character", "character"), "_") # data.frame
library(gsubfn)
read.pattern(text = a, pattern = "(.*)_(.*)")
We can use read.csv/read.table and specify the sep="_". It will split the strings into two columns.
read.csv(text=a, sep="_", header=FALSE)
Just to build on the initial comments
a <- c("abc_lvl1", "def_lvl2")
a1 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][1]}))
a2 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][2]}))
a1
[1] "abc" "def"
a2
[1] "lvl1" "lvl2"

Perform multiple search-and-replaces on the colnames of a dataframe

I have a dataframe with 95 cols and want to batch-rename a lot of them with simple regexes, like the snippet at bottom, there are ~30 such lines. Any other columns which don't match the search regex must be left untouched.
**** Example: names(tr) = c('foo', 'bar', 'xxx_14', 'xxx_2001', 'yyy_76', 'baz', 'zzz_22', ...) ****
I started out with a wall of 25 gsub()s - crude but effective:
names(tr) <- gsub('_1$', '_R', names(tr))
names(tr) <- gsub('_14$', '_I', names(tr))
names(tr) <- gsub('_22$', '_P', names(tr))
names(tr) <- gsub('_50$', '_O', names(tr))
... yada yada
#Joshua: mapply doesn't work, turns out it's more complicated and impossible to vectorize. names(tr) contains other columns, and when these patterns do occur, you cannot assume all of them occur, let alone in the exact order we defined them. Hence, try 2 is:
pattern <- paste('_', c('1','14','22','50','52','57','76','1018','2001','3301','6005'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used
Can anyone fix this for me?
EDIT: I read all around SO and R doc on this subject for over a day and couldn't find anything... then when I post it I think of searching for '[r] translation table' and I find xlate. Which is not mentioned anywhere in the grep/sub/gsub documentation.
Is there anything in base/gsubfn/data.table etc. to allow me to write one search-and-replacement instruction? (like a dictionary or translation table)
Can you improve my clunky syntax to be call-by-reference to tr? (mustn't create temp copy of entire df)
EDIT2: my best effort after reading around was:
The dictionary approach (xlate) might be a partial answer to, but this is more than a simple translation table since the regex must be terminal (e.g. '_14$').
I could use gsub() or strsplit() to split on '_' then do my xlate translation on the last component, then paste() them back together. Looking for a cleaner 1/2-line idiom.
Or else I just use walls of gsub()s.
Wall of gsub could be always replace by for-loop. And you can write it as a function:
renamer <- function(x, pattern, replace) {
for (i in seq_along(pattern))
x <- gsub(pattern[i], replace[i], x)
x
}
names(tr) <- renamer(
names(tr),
sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005')),
sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
)
And I found sprintf more useful than paste for creation this kind of strings.
The question predates the boom of the tidyverse but this is easily solved with the c(pattern1 = replacement1) option in stringr::str_replace_all.
tr <- data.frame("whatevs_1" = NA, "something_52" = NA)
tr
#> whatevs_1 something_52
#> 1 NA NA
patterns <- sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005'))
replacements <- sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
names(replacements) <- patterns
names(tr) <- stringr::str_replace_all(names(tr), replacements)
tr
#> whatevs_R something_C
#> 1 NA NA
And of course, this particular case can benefit from dplyr
dplyr::rename_all(tr, stringr::str_replace_all, replacements)
#> whatevs_R something_C
#> 1 NA NA
Using do.call() nearly does it, it objects to differing arg lengths. I think I need to nest do.call() inside apply(), like in apply function to elements over a list.
But I need a partial do.call() over pattern and replace.
This is all starting to make a wall of gsub(..., fixed=TRUE) look like a more efficient idiom, if flabby code.
pattern <- paste('_', c('1','14','22','50'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used