read.dta convert.dates not working? - regex

I have a Stata dataset, call it dataset.dta. I want to read it in R. I am using the package foreign. Problem is it fails to parse/convert Stata dates to R dates.
It goes something like this:
df <- read.dta( 'dataset.dta', convert.dates = TRUE )
# Check attributes
attr( df, "formats")
"%9s" "%8.0g" "%12.0g" "%12.0g" "%9.0g" "%21s" "%31s" "%td" "%td"
# Last two columns are dates i.e. %td
str( df )
... # Only showing last two columns
$ start_sample: num 15494 14246 14246 14670 14245 ...
$ end_sample : num 18262 18262 18262 18262 18262 ...
I was expecting Date class for these, instead of num. When I look into the source code of read.dta I find this.
if (convert.dates) {
ff <- attr(rval, "formats")
dates <- grep("%-*d", ff)
base <- structure(-3653, class = "Date")
for (v in dates) rval[[v]] <- base + rval[[v]]
}
Changing the third line here to dates <- grep( "%*d", ff) seems to take care of the issue. I changed the regex. I'm using Stata version 13.0.
Am I missing something? This just a bug or am I doing something woefully wrong here?

Two quick fixes/hacks. The first is
#### Convert to dates ####
datelookup <- format(seq(as.Date("1960-01-01"), as.Date("2015-12-31"), by = "1 day"))
df$start_sample_ dates <- datelookup[ df$start_sample + 1]
df$start_sample_dates <- datelookup[ df$end_sample + 1]
Stata uses 01/01/1960 as the base. The second is
#### Stealing from foreign package ####
ff <- attr(df, "formats")
dates <- grep("%*d", ff)
base <- structure(-3653, class = "Date")
for (v in dates) df[[v]] <- base + df[[v]]
Why structure(-3653, class = "Date") ? See comment #Dimitriy V. Masterov above. This issue could be specific to Stata version 13.0. See comment #dickoa above. Thanks for your help.

Related

How to knit out table codes into table in R markdown

I am a basic-level learner of R. I am having a problem knitting out tables with a code my professor designed for the students. The code for table designs is set as below. I put this in my R markdown as below.
```{r, results="hide", message=FALSE, warning = FALSE, error = FALSE}
## my style latex summary of regression
jhp_report <- function(...){
output <- capture.output(stargazer(..., omit.stat=c("f", "ser")))
# The first three lines are the ones we want to remove...
output <- output[4:length(output)]
# cat out the results - this is essentially just what stargazer does too
cat(paste(output, collapse = "\n"), "\n")
}
```
After this, I tried printing this out with knitr.
```{r, message=FALSE, warning = FALSE, error = FALSE}
set.seed(1973)
N <- 100
x <- runif(N, 6, 20)
D <- rbinom(N, 1, .5)
t <- 1 + 0.5*x - .4*D + rnorm(N)
df.lm <- data.frame(y = y, x =x, D =D)
df.lm$D <- factor(df.lm$D, labels = c('Male', 'Female'))
##REGRESSION
reg.parallel <- lm(y ~ x + D, data = df.lm)
jhp_report(reg.parallel, title = "Result", label = "tab:D", dep.var.labels = "$y$")
```
As a result, instead of a table, it keeps on showing only the pure codes. I would like to know how I have to set up R markdown for it to print out the table instead of the codes. This is how the result looks like when I knit it.
I expected that there must be some setup options to print the table out. But I couldn't find the right one. Also, my assignment for class requires students to use this code. I did find other options like knitr::kable but I would like to use the given code for this assignment.
Thank you in advance!

Need to extract 4 spaces of text before the occurrence of a word that appears in a column in a df, and may occur several times per row

I need to extract text (4 characters) before the occurrence of the word "exception" per row in a column of my dataframe. For example, see two lines of my data below:
MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)
MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014);
As you can see, "exception" occurrs more than one time per line, is sometimes capitalized and sometimes not, and has different text before. I need to extract the "FMV", "MM", and "ial" that come before in each case. The goal is to extract as a version of the following (comma separating would be fine but not needed):
"FMVMM"
"FMVial"
I am planning on making all text lower case for simplicity, but I cannot find a regex to extract the 4 characters of text I need after that. Any recommendations?
You basically need strsplit, substr and nchar:
t1 <- "1.MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)"
t2 <- "2.MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014); "
f <- function(x){
tmp <- strsplit(x, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
f(t1) #gives "FMV , MM "
f(t2) #gives "FMV ,ial "
Avoiding the loop would be better but for now, this should work.
Edit by Qaswed: Improved the function (shorter and does not need tolower any more).
Edit by TigeronFire:
#Qaswed, thank you for your guidance - the answer, however, poses another problem. t1 and t2 are only two lines on a dataframe 10000 rows long. I attempted to add the column logic to the function you built a few different ways, but I always received the error message:
"Error in strsplit(BOSSMWF_practice$Documents, "[Ee]xception") : non-character argument"
I tried the following with reference to dataframe column BOSSMWF_practice$Documents:
f <- function(x){
tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
AND:
f <- function(x){
BOSSMWF_practice$tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
BOSSMWF_practice$ret <- array(dim = length(BOSSMWF_practice$tmp) - 1)
for(i in 1:length(BOSSMWF_practice$ret)){
BOSSMWF_practice$ret[i] <- substr(BOSSMWF_practice$tmp[i], start = nchar(BOSSMWF_practice$tmp[i]) - 3, stop = nchar(BOSSMWF_practice$tmp[i]))
}
return(paste(ret, collapse = ","))
}
I attempted to run the function on my applicable column using both function setups
BOSSMWF_practice$Funct <- f(BOSSMWF_practice$Documents)
But I always received the above error message. Can you take your advice one step further and indicate how to apply this to a dataframe and place the results in a new column?
Edit by Qaswed:
#TigeronFire you should have added a comment to my answer or editing your question, but not editing my question. To your comment:
#if your dataset looks something like this:
df <- data.frame(variable_name = c(t1, t2))
#...use
apply(df, 1, FUN = f)
#note: there was an error in f. You need strsplit(x, ...) and not strsplit(t1, ...).

R: Regress all variables that match certain pattern

Is there a way in R to add all variables into a regression that match a certain pattern? For example, I have a bunch of variables in my dataset that correspond to holidays with the prefix h_ and I have other variables with other prefixes such as a_
Is there a way to do something like this:
lm(homicide ~ w_* + a_*, data= df)
To programmatically construct a formula, have a look at reformulate().
Here's an example that uses grep() to find all variables that begin with a "d" and then uses reformulate() to plug them in as the regressor variables on the RHS of a formula object.
vv <- grep("^d.*", names(mtcars), value=TRUE)
ff <- reformulate(termlabels=vv, response="mpg")
lm(ff, data=mtcars)
#
# Call:
# lm(formula = ff, data = mtcars)
#
# Coefficients:
# (Intercept) disp drat
# 21.84488 -0.03569 1.80203
A string can be turned into a formula.
data(iris)
fmla <- as.formula(paste("Species ~",
paste(grep("Width", names(iris), value = TRUE), collapse = " + ")))
glm(fmla, data = iris, family = binomial(link = "logit"))

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

I have a CSV file like
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
Desired Output to a CSV file with the first row as the headers
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
How to modify the above regular expressions to get the result in R?
New to R, any help is appreciated.
library(stringr)
xx.to.split <- with(xx, setNames(gsub("Metro", "", as.character(CampaignName)), Market))
do.call(rbind, str_match_all(xx.to.split, "(.+?) ([A-Z]{2}) ?"))[, -1]
Produces:
[,1] [,2]
Wells Fargo "Gary" "IN"
Wells Fargo "Chicago" "IL"
EMC "Los Angeles" "CA"
EMC "Boston" "MA"
Apple "Cupertino" "CA"
This should work even if you have different number of Compaign Names in each market. Unfortunately I think base options are annoying to implement because frustratingly there isn't a gregexec, although I'd be curious if someone comes up with something comparably compact in base.
Here is a solution using base R. Split the CampaignName column on the string Metro adding sequential numbers as names. stack turns it into a data frame with columns ind and values which we massage into DF1. Merge that with xx by the sequence numbers of DF1 and the row numbers of xx. Move Market to the front of DF2 and remove ind and CampaignName. Finally write it out.
xx <- read.csv("Campaign.csv", as.is = TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- seq_along(s)
ss <- stack(s)
DF1 <- with(ss, data.frame(ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
DF2 <- merge(DF1, xx, by.x = "ind", by.y = 0)
DF <- DF2[ c("Market", setdiff(names(DF2), c("ind", "Market", "CampaignName"))) ]
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
REVISED to handle extra columns after poster modified the question to include such. Minor improvements.

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?
Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA
In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]