Trouble recreating R code that uses case_when with the replace() function - replace

I am trying to familiarize myself with the different functions and R and ran into some difficulties recreating the following code using the replace() instead of case_when().
The packages I am using are:
library(tidyverse)
library(nycflights13)
The following is the code I am trying to recreate
diamonds %>%
select(carat)%>%
mutate(new_carat = case_when(carat <.28 ~ "too small",
carat>=.7 & carat<=1.04~"too big",
carat>4 ~ NA_character_,
TRUE~ as.character(carat)))
The following code is what I have tried
diamonds%>%
select(carat)%>%
mutate(carat_new = replace(carat,carat<=.28, "too small"),
carat_new = replace(carat, carat>=.7 & carat <=1.04, "too big"),
carat_new= replace(carat, carat>4, NA_character_),
carat_new= replace(carat, TRUE, as.character(carat)))
This didn't change anything except converting the data from dbl to chr

Related

Need to extract 4 spaces of text before the occurrence of a word that appears in a column in a df, and may occur several times per row

I need to extract text (4 characters) before the occurrence of the word "exception" per row in a column of my dataframe. For example, see two lines of my data below:
MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)
MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014);
As you can see, "exception" occurrs more than one time per line, is sometimes capitalized and sometimes not, and has different text before. I need to extract the "FMV", "MM", and "ial" that come before in each case. The goal is to extract as a version of the following (comma separating would be fine but not needed):
"FMVMM"
"FMVial"
I am planning on making all text lower case for simplicity, but I cannot find a regex to extract the 4 characters of text I need after that. Any recommendations?
You basically need strsplit, substr and nchar:
t1 <- "1.MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)"
t2 <- "2.MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014); "
f <- function(x){
tmp <- strsplit(x, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
f(t1) #gives "FMV , MM "
f(t2) #gives "FMV ,ial "
Avoiding the loop would be better but for now, this should work.
Edit by Qaswed: Improved the function (shorter and does not need tolower any more).
Edit by TigeronFire:
#Qaswed, thank you for your guidance - the answer, however, poses another problem. t1 and t2 are only two lines on a dataframe 10000 rows long. I attempted to add the column logic to the function you built a few different ways, but I always received the error message:
"Error in strsplit(BOSSMWF_practice$Documents, "[Ee]xception") : non-character argument"
I tried the following with reference to dataframe column BOSSMWF_practice$Documents:
f <- function(x){
tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
AND:
f <- function(x){
BOSSMWF_practice$tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
BOSSMWF_practice$ret <- array(dim = length(BOSSMWF_practice$tmp) - 1)
for(i in 1:length(BOSSMWF_practice$ret)){
BOSSMWF_practice$ret[i] <- substr(BOSSMWF_practice$tmp[i], start = nchar(BOSSMWF_practice$tmp[i]) - 3, stop = nchar(BOSSMWF_practice$tmp[i]))
}
return(paste(ret, collapse = ","))
}
I attempted to run the function on my applicable column using both function setups
BOSSMWF_practice$Funct <- f(BOSSMWF_practice$Documents)
But I always received the above error message. Can you take your advice one step further and indicate how to apply this to a dataframe and place the results in a new column?
Edit by Qaswed:
#TigeronFire you should have added a comment to my answer or editing your question, but not editing my question. To your comment:
#if your dataset looks something like this:
df <- data.frame(variable_name = c(t1, t2))
#...use
apply(df, 1, FUN = f)
#note: there was an error in f. You need strsplit(x, ...) and not strsplit(t1, ...).

Why RcppArmadillo's fastLmPure produces NA's in output but fastLm doesn't?

I use rolling regression in R quite a lot and my initial setup is something like:
dolm <- function(x) coef(lm(x[,1] ~ x[,2] + 0, data = as.data.frame(x)))
rollingCoef = rollapply(someData, 100, dolm)
Above example works perfectly, except it is slow if you have a lot of iterations.
To speed it up I've decided to experiment with Rcpp package.
First I substituted lm with fastLm, result is a bit faster but still slow. So that pushed me to attempt to write the entire rolling regression's coefficients function in c++ as for loop and than integrate it in R with Rcpp help.
So I've changed original RcppArmadillo's function fastLm to this:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::export]]
List rollCoef(const arma::mat& X, const arma::colvec& y, double window ) {
double cppWindow = window - 1;
double matRows = X.n_rows;
double matCols = X.n_cols - 1;
arma::mat coef( matRows - cppWindow, X.n_cols); // matrix for estimated coefficients
//for loop for rolling regression.
for( double i = 0 ; i < matRows - cppWindow ; i++ )
{
coef.row(i) = arma::trans(arma::solve(X( arma::span(i,i + cppWindow), arma::span(0,matCols)) , y.rows(i,i + cppWindow)));
}
return List::create(_["coefficients"] = coef);
}
and than download it to R with sourceCpp(file=".../rollCoef.cpp")
So it's much faster than rollapply and it worked fine on small examples, but than I applied it to ~200000 observations of data it produced ~half of NA's in output, in the same time rollapply/fastLm combination didn't produce any.
So here I need some help. What is wrong with my function? Why are there NA's in my function output, and no NA's in rollapply/fastLm, however, if I understand right, them both based on arma::solve? Any help is highly appreciated.
UPDATE
Here is reproducible code:
require(Rcpp)
require(RcppArmadillo)
require(zoo)
require(repmis)
myData <- source_DropboxData(file = "example.csv",
key = "cbrmkkbssu5bn96", sep = ",", header = TRUE)
## in order to use my custom function "rollCoef" you should download it to R.
## The c++ code is presented above in the main question.
## Download it where you want as "rollCoef.cpp" and then download it to R with:
sourceCpp(file=".../rollCoeff.cpp"). # there should be your actual path.
myCoef = rollCoef(as.matrix(myData[,2]),myData[,1],260)
summary(unlist(myCoef)) # 80923 NA's
dolm = function(x) coef(fastLmPure(as.matrix(x[,2]), x[,1]))
myCoef2 = rollapply(myData, 260, dolm, by.column = FALSE)
summary(myCoef2) # 80923 NA's
dolm2 = function(x) coef(fastLm(x[,1] ~ x[,2] + 0, data = as.data.frame(x)))
myCoef3 = rollapply(myData, 260, dolm2, by.column = FALSE)
summary(myCoef3) # !!! No NA's !!!
head(unlist(myCoef)) ; head(unlist(myCoef2)) ; head(myCoef3)
So the output of my function is identical to output of RcppArmadillo's fastLmPure combined with rollapply and them both produce NA's, but rollapply with fastLm does not. As I understand, for example from HERE and HERE fastLm is basically calling to fastLmPure, but why is there no NA's in the third method than? Is there some additional capabilities in fastLm that prevent NA's that I didn't spotted?
There is an entire package RcppRoll to do just that custom rolling -- and you should be able to extend it and its rollit() function to do rolling lm() as well.

sapply function in R is not giving me the desired result

I'm trying to use sapply instead of a 'for' loop but I'm not getting the result I'm expecting, I've tested each line separated and the code is working but when I use sapply is not. I'm looking for some hints on what might be wrong:
event <- c('Astronomical Low Tide', 'Avalanche', 'Blizzard', 'Coastal Flood',
'Cold/Wind Chill', 'Debris Flow', 'Dense Fog', 'Dense Smoke', 'Drought',
'Dust Devil', 'Dust Storm','Excessive Heat', 'Extreme Cold/Wind Chill',
'Flash Flood', 'Flood', 'Frost/Freeze', 'Funnel Cloud', 'Freezing Fog',
'Hail', 'Heat', 'Heavy Rain', 'Heavy Snow', 'High Surf', 'High Wind',
'Hurricane/Typhoon', 'Ice Storm', 'Lake/Effect Snow', 'Lakeshore Flood',
'Lightning', 'Marine Hail', 'Marine High Wind', 'Marine Strong Wind',
'Marine Thunderstorm Wind', 'Rip Current', 'Seiche', 'Sleet',
'Storm Surge/Tide', 'Strong Wind', 'Thunderstorm Wind', 'Tornado',
'Tropical Depression', 'Tropical Storm', 'Tsunami', 'Volcanic Ash',
'Waterspout', 'Wildfire', 'Winter Storm', 'Winter Weather')
replace <- function(dt, x, col) {
idx <- grep(paste('(?i)', event[x], sep = ''), dt[, col])
dt[idx, col] <- event[x]
}
sapply(1:length(event), function(x) replace(stormdata, x, 8))
Basically, what I'm trying to do is to use every value on the event variable as a pattern on the grep function within the custom made replace function then I get the index of the rows that matched my pattern and stored them in the idx variable. After that I want to replace the rows in the data frame that correspond to the idx values with the value contained in the event variable.
I'm trying to create a loop with the sapply function to use every value on the event variable, so I want a loop that goes 48 times looking for each pattern in the data frame stormdata on its 8th column and replace them. BUT my code does nothing, after running it the data remains the same, no substitutions. When I run each line separately without the sapply it works.
I've looking everywhere, I can't find why isn't working. Help.
Try using global assignment eg stormdata[idx, col] <<- event[x] in your function. Not clean but probably will work.

Calling a web service that depends on a result from another web service in play framework

I want to call a second web service with the result from the first one.
Below is some code that highlights my intent.
By the way it compiles fine in IntelliJ(Probable a bug in the IDE).
def get = {
for {
respA <- WS.url(url1).get
id <- respA.body.split(",").take(2)
respB <- WS.url(url2 + id).get // Here is a compile error
} yield {
getMyObjects(respB.xml)
}
}
respA = is a comma separated list with ids used in the next call.
respB = is an XML response that I parse in the yield method
The compile error Play Framework gives me:
type mismatch;
found : scala.concurrent.Future[Seq[MyObject]]
required: scala.collection.GenTraversableOnce[?]
I find the compile error strange.
How can a Future of [Seq[MyObject]] exist at that line?
It shouldn't be any different from the line two lines up that compiles?
WS.url(url1).get returns Future[Response] so all your generators in the for comprehension should be futures. In your code snippet, you are mixing Array[String] and Future[Response]. See Type Mismatch on Scala For Comprehension for some background on for comprehension.
So I would try something like this:
for {
respA <- WS.url(url1).get
ids = respA.body.split(",").take(2)
responsesB <- Future.sequence(ids.map(id => WS.url(url2 + id).get))
} yield {
responsesB.map(respB => getMyObjects(respB.xml))
}
So the types are:
respA: Response
ids: Array[String]
ids.map(id => WS.url(url2 + id).get): Array[Future[Response]]
Future.sequence(ids.map(id => WS.url(url2 + id).get)): Future[Array[Response]]
responsesB: Array[Response]
And the return type of the for comprehension is a Future of an array of whatever getMyObjects returns.
Note that if sequence does not work on Future[Array[_]] try to do ids = respA.body.split(",").toList.take(2).

Why does my unit test run successfully in the R console but returns an error with "make test"?

I am learning how to develop an R package. Everything goes well, thanks to the R manuals and this wiki for RUnit. More precisely, when I launch my unit tests within a new R console, all tests finish successfully:
#rm(list=ls())
library(RUnit)
testSuite <- defineTestSuite("current", "~/src/mypkg/inst/unitTests/")
isValidTestSuite # returns TRUE
runTestSuite(testSuite) # returns Number of errors: 0 and Number of failures: 0
However, when I launch them in a terminal, I got one error (the function in question uses the package GenomicRanges that I installed in "~/src/Rlibs"):
$ make test R_LIBS="~/src/Rlibs/"
...
ERROR in test.MyFunction: Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
I don't see what is causing this error. I guess you will need more info about the code and the test, but it's not easy because I don't know how to replicate this error on a small example without making a new package just for this. Maybe some of you will have an idea about this error message and give me some hints?
Edit: to help someone to give me a hint on the error, here is the code I wrote for a dummy package. The aim is to find which items of "p" are included within items of "g".
Here is the test:
test.MyFunction <- function(){
g <- list(c1=data.frame(name=c("g1","g2"), start=c(11,1111),
end=c(500,1500), strand=c("+","+"), stringsAsFactors=FALSE))
p <- list(c1=data.frame(name=c("p1","p2"), strand=c("+","-"),
start=c(11,601), end=c(20, 610), stringsAsFactors=FALSE))
exp <- list(c1=list(g1=c("p1"))) # item "p1" is included in item "g1"
obs <- MyFunction(g, p)
checkEquals(obs, exp)
}
And here is the function itself:
MyFunction <- function(g, p){
res <- lapply(names(g), function(c.name){
res.c <- list()
nb.g <- length(g[[c.name]]$name)
if(length(.find.package("GenomicRanges", quiet=TRUE)) > 0){
g.ranges <- GRanges(seqnames=Rle(c(c.name), c(nb.g)),
ranges=IRanges(g[[c.name]]$start,
g[[c.name]]$end, names=g[[c.name]]$name),
strand="*")
p.ranges <- GRanges(seqnames=Rle(c(c.name), nrow(p[[c.name]])),
ranges=IRanges(p[[c.name]]$start,
p[[c.name]]$end, names=p[[c.name]]$name),
strand=p[[c.name]]$strand)
for(g.name in names(g.ranges)){
links <- p.ranges %in% g.ranges[names(g.ranges) == g.name]
if(sum(links) > 0)
res.c[[g.name]] <- names(p.ranges)[which(links)]
}
} else{
msg <- "can't find package GenomicRanges"
stop(msg, call.=FALSE)
}
res.c
})
names(res) <- names(g)
return(res)
}
I think this line is your culprit:
links <- p.ranges %in% g.ranges[names(g.ranges) == g.name].
%in% is match, and that is what the error message seems to be reading:
ERROR in test.MyFunction: Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
There is something about p.ranges and g.ranges that it doesn't like. I.e., they can't be coerced to vectors OR you're not subsetting properly and the object type is incorrect ([ or [[).