R gsub replace several texts at once

R gsub replace several texts at once - regex

In my data frame there is a column with multiplier indicators, such as thousands, hundreds, millions, etc., as text.
I'd like to convert them to numeric. This is what I've tried:
a <- c("Thousands", "thousands", "Hundreds", "hundreds")
newA <- as.numeric(gsub("[Tt]housands","1000",gsub("[Hh]undreds","100",a)))
Which works, but results very cumbersome when there are many multipliers (as is the case). I was thinking there should be a way to do it in one gsub call, but wasn't able to do it. Something like this is what I would like (of course this particular attempt didn't work):
as.numeric(gsub("^.*-","",gsub("([Hh]undreds)([Tt]housands)","\\1-100 \\2-1000",a)))

Try:
library(qdap)
as.numeric(multigsub(c("[Tt]housands", "[Hh]undreds"), c(1000, 100), fixed = FALSE, a))
Or as per suggested by #RichardScriven:
library(stringi)
as.numeric(stri_replace_all_regex(a, c("[Tt]housands", "[Hh]undreds"), c(1000, 100),
vectorize_all = FALSE))

Related

Replacing Value in Vector that are greater than x

I am collecting data from a number of different reports and some are using measurements on a different factor base.
So the vector looks like this:
a <- c("1.5", "1.8", "2.1", "3.4", "4100", "5.1", "6.3", "7.9", "8700", "8.9", "9.0", "11.7")
I'd like to divide the numbers which are greater than 100 by 1000.
Looking at another example I've tried
a[a>100] <- a/1000
But that doesn't seem to store the correct result. Any advice greatfully received.

What language you use for this?
In most dynamical language you can do something like this
a.map({ item -> if (item > 100) item / 1000 else item })
You need to adjust syntax, but generally it'll look like this.

Bin new data according to existing intervals given as factor levels

I have a factor with levels which represent intervals (as produced by cut):
> head(data.train$glucose)
[1] [0,126] [0,126] (126,199] [0,126] [0,126] [0,126]
Levels: [0,126] (126,199]
Now I want to generate a new factor with the same levels from a numeric vector so that when the respective number falls into the first interval (e.g. 24) it becomes [0,126] and if it falls into the second interval (e.g. 153) it becomes (126,199].
The number of intervals can differ as can the form of the brackets (depending on whether they are open or closed intervals).
I think I can use sub together with cut for that (as in the last example in the helpfile of cut) but I am not very good at it to make it general enough. Is there also another way which is a little bit more intuitive? But perhaps I am thinking too complicated anyway...
If you give a solution with sub please explain the expression. Please also do not give solutions with functions from other packages as I am developing a package myself and I want to keep it as lean as possible.

I was looking for an elegant way to do this, but ended up using regex like you suggested:
ints<-cut(1:10,5)
set.seed(345)
a<-runif(20,1,10)
# get levels
levs <- levels(ints)
# remove brackets
levs.num <- sub( "^[\\(\\[]{1}(.+)[\\)\\]]{1}$" , "\\1" ,levs , perl = TRUE)
levs.right <- sub( "^[\\(\\[]{1}.+([\\)\\]]{1})$" , "\\1" ,levs , perl = TRUE)
levs.left <- sub( "^([\\(\\[]{1}).+[\\)\\]]{1}$" , "\\1" ,levs , perl = TRUE)
# get breaks
breaks <- unique(as.numeric(unlist(strsplit(levs.num ,","))))
if(all(levs.right=="]")){
right.arg <- TRUE
}else if(all(levs.left=="[")){
right.arg <- FALSE
}else{
stop("problem")
}
table(cut(a,breaks , right = right.arg ))
My regex should select everything between [ or ( and ] or ) and return it

aregexec matching with two data frames

One is the target data frame (targetframe) and the other dataframe works as a library (word.library) with some key values. Then I need the following algorithm: The algorithm should look up an approximate match between word.library$mainword and targetframe$words. After figuring out the approximate match the substrings in targetframe$words should be replaced with word.library$keyID.
Here are the two data frames mentioned above:
tragetframe <- data.frame(words= c("This is sentence one with the important word",
"This is sentence two with the inportant woord",
"This is sentence three with crazy sayings" ))
word.library <- data.frame(mainword = c("important word",
"crazy sayings"),
keyID = c("1001",
"2001"))
Here is my solution which works.
for(i in 1:nrow(word.library)){
positions <- aregexec(word.library[i,1], tragetframe$words, max.distance = 0.1)
res <- regmatches(tragetframe$words, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
tragetframe$words <- Vectorize(gsub)(unlist(res), word.library[i,2], tragetframe$words)
tragetframe$words
}
However: I use a for loop which is not efficent (imagine I have two huge data frames). Has anyone an idea how to resovle this issue more efficiently?

Identify the format of a character column in R

I'm dealing with a huge dataset having 500 columns and a huge number of rows out of which I can take a significantly big sample (e.g. 1 million).
All the columns are in the character format although they can represent different data types:numeric, date, ... I need to build a function that, given a column as an input, recognized its format, taking account of NA values as well.
For instance, given a column col, I recognise if it is numeric in this way.
col <- c(as.character(runif(10000)), rep('NaN', 10))
maxPercNa <- 0.10
nNa <- sum(is.na(as.numeric(col)))
percNa <- nNa / length(col)
isNumeric <- percNa < maxPercNa
In a similar way, I need to recognise dates, integers, ... I was thinking about using regular expressions. A challenge is that the dataset is very big, so the technique should be efficient.
If anyone comes up with a brilliant idea, it'll be really appreciated :) Thanks in advance!

Column of lists inside a dataframe in R

Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.

It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"

If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().

Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R gsub replace several texts at once - regex

Try: library(qdap) as.numeric(multigsub(c("[Tt]housands", "[Hh]undreds"), c(1000, 100), fixed = FALSE, a)) Or as per suggested by #RichardScriven: library(stringi) as.numeric(stri_replace_all_regex(a, c("[Tt]housands", "[Hh]undreds"), c(1000, 100), vectorize_all = FALSE))

Related

Replacing Value in Vector that are greater than x

Bin new data according to existing intervals given as factor levels

aregexec matching with two data frames

Identify the format of a character column in R

Column of lists inside a dataframe in R

Categories

Resources