R-markdown reproducibility with set.seed

R-markdown reproducibility with set.seed - r-markdown

I wrote an R script to randomly assign participants for and RCT. I used set.seed() to ensure I would have reproducible results.
I now want to document what I have done in an R markdown document and confusingly I don't get the same results, despite using the same seed.
Here is the code chunk:
knitr::opts_chunk$set(cache = T)
set.seed(4321)
Group <- sample(1:3, 5, replace=TRUE)
couple.df <- data.frame(couple.id=1:5,
partner1=paste0("FRS0", c(35, 36, 41, 50, 61)),
partner2=paste0("FRS0", c(38, 37, 42, 51, 62)),
Group)
print(couple.df)
And here is the output I get when running it as a chunk:
couple.id
<int>
partner1
<chr>
partner2
<chr>
Group
<int>
1 FRS035 FRS038 2
2 FRS036 FRS037 3
3 FRS041 FRS042 2
4 FRS050 FRS051 1
5 FRS061 FRS062 3
(not sure how to get this to format)
This is the same as I had when I wrote the original code as an R script.
However, when I knit the markdown file I get the following output in my html document (sorry again about the formatting - I have just copied and pasted from the html document, adding in the ticks to format it as code, pointers on how to do this properly would also be welcome)
knitr::opts_chunk$set(cache = T)
set.seed(4321)
Group <- sample(1:3, 5, replace=TRUE)
couple.df <- data.frame(couple.id=1:5,
partner1=paste0("FRS0", c(35, 36, 41, 50, 61)),
partner2=paste0("FRS0", c(38, 37, 42, 51, 62)),
Group)
print(couple.df)
## couple.id partner1 partner2 Group
## 1 1 FRS035 FRS038 1
## 2 2 FRS036 FRS037 2
## 3 3 FRS041 FRS042 3
## 4 4 FRS050 FRS051 2
## 5 5 FRS061 FRS062 1
That is, they are different. What is going on here and how can I get the markdown document to give the same results? I am committed to using the allocation I arrived at using the original script.

Related

how to sort out values in a certain time period in R?

I´ve got a time series of temperature data with some wrong values, I want to sort out. The problem is, I want to sort out only the points in a certain period of time.
If I sort out the wrong points by their temperature value, ALL of the points of this temperature value are sorted out (through the wohle measuring period)
This is a very easy version of my code (in reality, there are many more values)
laketemperature <- c(15, 14, 14, 12, 11, 9, 9, 8, 6, 4, 15, 14, 3) #only want to sort out the last 14 and 15
out <- c(15, 14)
laketemperature_clean <- laketemperature [- out] # the 15 and 14s at the beginning are sorted out, too :(
I want to have the whole laketemperature-series in the end, only without the second 15.
I already tried with ifelse, but it didn´t work out.

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?

It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

How to generate conditional words in the text? (Inline code)

I want to have some conditional words in my R Markdown document. Depending of the outcome in some of the calculations from the tables different words should show up in the ordinary text. Please, see my example below:
The table (a chunk):
testtabell <- matrix(c(32, 33, 45, 67, 21, 56, 76, 33, 22), ncol=3,byrow = TRUE)
colnames(testtabell) <- c("1990", "1991", "1992")
rownames(testtabell) <- c("Region1", "Region2", "Region3")
testtabell <- as.table(testtabell)
testtabell
This should be in the inline code and generate different word options in the regular text flow in the RMD:
`r if testtabell[2,2]-[2,1] < testtabell[3,2]-testtabell[3,1] then type "under" or else "above"`

Although the question is ancient, maybe someone can use this solution.
You can use the following inline code:
`r ifelse(testtabell[2,2]-testtabell[2,1] < testtabell[3,2]-testtabell[3,1],"under","above")`

Split one column into two columns and retaining the seperator

I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.

Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')

In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

Iterating over selection with query of an HDFStore

I have a very large table in an HDFStore of which I would like to select a subset using a query and then iterate over the subset chunk by chunk. I would like the query to take place before the selection is broken into chunks, so that all of the chunks are the same size.
The documentation here seems to indicate that this is the default behavior but is not so clear. However, it seems to me that the chunking is actually taking place before the query, as shown in this example:
In [1]: pd.__version__
Out[1]: '0.13.0-299-gc9013b8'
In [2]: df = pd.DataFrame({'number': np.arange(1,11)})
In [3]: df
Out[3]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
[10 rows x 1 columns]
In [4]: with pd.get_store('test.h5') as store:
store.append('df', df, data_columns=['number'])
In [5]: evens = [2, 4, 6, 8, 10]
In [6]: with pd.get_store('test.h5') as store:
for chunk in store.select('df', 'number=evens', chunksize=5):
print len(chunk)
2
3
I would expect only a single chunk of size 5 if the querying were happening before the result is broken into chunks, but this example gives two chunks of lengths 2 and 3.
Is this the intended behavior and if so is there an efficient workaround to give chunks of the same size without reading the table into memory?

I think when I wrote that, the intent was to use chunksize of the results of the query. I think it was changed as was implementing it. The chunksize determines sections that the query is applied, and then you iterate on those. The problem is you don't apriori know how many rows that you are going to get.
However their IS a way to do this. Here is the sketch. Use select_as_coordinates to actually execute your query; this returns an Int64Index of the row number (the coordinates). Then apply an iterator to that where you select based on those rows.
Something like this (this makes a nice recipe, will include in the docs I think):
In [15]: def chunks(l, n):
return [l[i:i+n] for i in xrange(0, len(l), n)]
....:
In [16]: with pd.get_store('test.h5') as store:
....: coordinates = store.select_as_coordinates('df','number=evens')
....: for c in chunks(coordinates, 2):
....: print store.select('df',where=c)
....:
number
1 2
3 4
[2 rows x 1 columns]
number
5 6
7 8
[2 rows x 1 columns]
number
9 10
[1 rows x 1 columns]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R-markdown reproducibility with set.seed - r-markdown

Related

how to sort out values in a certain time period in R?

Using dictionary in regexp_replace function in pyspark

How to generate conditional words in the text? (Inline code)

Split one column into two columns and retaining the seperator

Iterating over selection with query of an HDFStore

Categories

Resources