R list function - list

I have a problem applying a function to list elements. I have a list called "mylist", which looks like:
[[1]] station global
1 2
1 2
1 2
1 14
1 38
1 169
[[2]] station global
2 2
2 2
2 23
2 86
In each list, I need to set values of "global" less than or equal to 2 to NA.
I have used
dat.list <- lapply(mylist, ``[[``, 'global')
to get only the global data.
Defining af function:
fct <- function(x) {
x[x <= 2] <- NA
}
and writing
lapply(dat.list, fct)
gives
[[1]] NA
[[2]] NA
What I would like to have is:
[[1]] station global
1 NA
1 NA
1 NA
1 14
1 38
1 169
[[2]] station global
2 NA
2 NA
2 23
2 86
I apprechiate any help or a point in the right direction, Regards Sisse

It would help if you posted a reproducible example. See here for advice on how to do this.
x will take on the element of the list. Since those appear to be data.frames, treat x as a data.frame:
fct <- function(x) {
x$global[x$global <= 2] <- NA
x
}

Related

Creating A Dataframe From A Text Dataset

I have a dataset that has hundreds of thousands of fields. The following is a simplified dataset
dataSet <- c("Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb DFStorLocLevel",
"0231 0002 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD X A A A 18 136 30 29 50 43 24.88 51.000 EA",
"0231 0002 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD X B B A 16 17 3 3 5 4 483.87 1.000 EA X",
"0231 0002 WH.920569 SPINDLE MOTOR MINI O 22 PD X A A A 69 85 15 9 25 13 680.91 21.000 EA",
"0231 0002 GB.C150583-00001 VALVE-AIR MDI 64 PD X A A A 16 113 50 35 80 52 19.96 116.000 EA",
"0231 0002 FG.124-0140 BEARING 32 PD X A A A 36 205 35 32 50 48 21.16 55.000 EA",
"0231 0002 WP.254997 BEARING,BALL .9843 X 2.04 52 PD X A A A 18 155 50 39 100 58 2.69 181.000 EA"
)
I would like to create a dataframe out of this dataSet for further calculation. The approach I am following is as follows:
I split the dataSet by space and then recombine it.
dataSetSplit <- strsplit(dataSet, "\\s+")
The header (which is the first line) splits correctly and produces 25 characters. This can be seen by the str() function.
str(dataSetSplit)
I will then intend to combine all the rows together using the folloing script
combinedData <- data.frame(do.call(rbind, dataSetSplit))
Please note that the above script "combinedData " errors because the split did not produce equal number of fields.
For this approach to work all the fields must split correctly into 25 fields.
If you think this is a sound approach please let me know how to split the fileds into 25 fields.
It is worth mentioning that I do not like the approach of splitting the data set with the function strsplit(). It is an extremely time consuming step if used with a large data set. Can you please recommend an alternate approach to create a data frame out of the supplied data?
By the looks of it, you have a header row that is actually helpful. You can easily use gregexpr to calculate your "widths" to use with read.fwf.
Here's how:
## Use gregexpr to find the position of consecutive runs of spaces
## This will tell you the starting position of each column
Widths <- gregexpr("\\s+", dataSet[1])[[1]]
## `read.fwf` doesn't need the starting position, but the width of
## each column. We can use `diff` to calculate this.
Widths <- c(Widths[1], diff(Widths))
## Since there are no spaces after the last column, we need to calculate
## a reasonable width for that column too. We can do this with `nchar`
## to find the widest row in the data. From this, subtract the `sum`
## of all the previous values.
Widths <- c(Widths, max(nchar(dataSet)) - sum(Widths))
Let's also extract the column names. We could do this in read.fwf, but it would require us to substitute the spaces in the first line with a "sep" character.
Names <- scan(what = "", text = dataSet[1])
Now, read in everything except the first line. You would use the actual file instead of textConnection, I would suppose.
read.fwf(textConnection(dataSet), widths=Widths, strip.white = TRUE,
skip = 1, col.names = Names)
# Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty
# 1 231 2 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD NA X A A A 18 136
# 2 231 2 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD NA X B B A 16 17
# 3 231 2 WH.920569 SPINDLE MOTOR MINI O 22 PD NA X A A A 69 85
# 4 231 2 GB.C150583-00001 VALVE-AIR MDI 64 PD NA X A A A 16 113
# 5 231 2 FG.124-0140 BEARING 32 PD NA X A A A 36 205
# 6 231 2 WP.254997 BEARING,BALL .9843 X 2.04 52 PD NA X A A A 18 155
# CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb
# 1 NA NA 30 29 50 43 NA 24.88 51 EA <NA>
# 2 NA NA 3 3 5 4 NA 483.87 1 EA X
# 3 NA NA 15 9 25 13 NA 680.91 21 EA <NA>
# 4 NA NA 50 35 80 52 NA 19.96 116 EA <NA>
# 5 NA NA 35 32 50 48 NA 21.16 55 EA <NA>
# 6 NA NA 50 39 100 58 NA 2.69 181 EA <NA>
# DFStorLocLevel
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
Many thanks to Ananda Mahto, he provided many pieces to this answer.
widthMinusFirst <- diff(gregexpr('(\\s[A-Z])+', dataSet[1])[[1]])
widthFirst <- gregexpr('\\s+', dataSet[1])[[1]][1]
Width <- c(widthFirst, widthMinusFirst)
Widths <- c(Width, max(nchar(dataSet)) - sum(Width))
columnNames <- scan(what = "", text = dataSet[1])
read.fwf(textConnection(dataSet[-1]), widths = Widths, strip.white = FALSE,
skip = 0, col.names = columnNames)

R: fragment a list

I am kind of tired of working with lists..and my limited R capabilities ... I could not solve this from long time...
My list with multiple dataframe looks like the following:
set.seed(456)
sn1 = paste( "X", c(1:4), sep= "")
onelist <- list (df1 <- data.frame(sn = sn1, var1 = runif(4)),
df2 <- data.frame(sn = sn1, var1 = runif(4)),
df3 <- data.frame(sn = sn1,var1 = runif(4)))
[[1]]
sn var1
1 X1 0.3852362
2 X2 0.3729459
3 X3 0.2179086
4 X4 0.7551050
[[2]]
sn var1
1 X1 0.8216811
2 X2 0.5989182
3 X3 0.6510336
4 X4 0.8431172
[[3]]
sn var1
1 X1 0.4532381
2 X2 0.7167571
3 X3 0.2912222
4 X4 0.1798831
I want make a subset list in which the row 2 and 3 are only present.
srow <- c(2:3) # just I have many rows in real data
newlist <- lapply(onelist, function(y) subset(y, row(y) == srow))
The newlist is empty....
> newlist
[[1]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[2]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[3]]
[1] sn var1
<0 rows> (or 0-length row.names)
Help please ....
Does this do it?
Note the comma after the rows which implicitly is interpreted as NULL and results in the extraction all of the columns:
> lapply(onelist, "[", c(2,3),)
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459
You could have gotten your subset strategy to work with:
lapply(onelist, function(y) subset(y, rownames(y) %in% srow ))
Note that many time people use "==" when they really should be using %in%
?match
I don't think the row function does what you think it does:
Returns a matrix of integers indicating their row number in a matrix-like object, or a factor indicating the row labels.
Looking at what it returns on the list you have
> row(onelist[[1]])
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
> row(onelist[[1]])==srow
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
You are doing a simple subset of the data.frames, so you can just use
newlist <- lapply(onelist, function(y) y[srow,])
which gives
> newlist
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459

Dataframes in a list; adding a new variable with name of dataframe

I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to.
Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order:
mylist <- llply(files.to.read, read.csv)
A few methods have been highlighted in several posts:
Working-with-dataframes-in-a-list-drop-variables-add-new-ones and
Using-lapply-with-changing-arguments
I have tried two similar methods, the first using the index list:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1,df2)
# Adds a new coloumn 'id' with a value of 5 to every row in every dataframe.
# I WANT to change the value based on the list index.
mylist1 <- lapply(mylist,
function(x){
x$id <- 5
return (x)
}
)
#Example of what I WANT, instead of '5'.
#> mylist1
#[[1]]
#x y id
#1 1 11 1
#2 2 12 1
#3 3 13 1
#4 4 14 1
#5 5 15 1
#
#[[2]]
#x y id
#1 1 11 2
#2 2 12 2
#3 3 13 2
#4 4 14 2
#5 5 15 2
The second attempts to pass the names() of the list.
# I WANT it to add a new coloumn 'id' with the name of the respective dataframe
# to every row in every dataframe.
mylist2 <- lapply(names(mylist),
function(x){
portfolio.results[[x]]$id <- "dataframe name here"
return (portfolio.results[[x]])
}
)
#Example of what I WANT, instead of 'dataframe name here'.
# mylist2
#[[1]]
#x y id
#1 1 11 df1
#2 2 12 df1
#3 3 13 df1
#4 4 14 df1
#5 5 15 df1
#
#[[2]]
#x y id
#1 1 11 df2
#2 2 12 df2
#3 3 13 df2
#4 4 14 df2
#5 5 15 df2
But the names() function doesn't work on a list of dataframes; it returns NULL.
Could I use seq_along(mylist) in the first example.
Any ideas or better way to handle the whole "merge with source id"
Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this.
files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE)
mylist <- llply(files.to.read, read.csv)
all <- do.call("rbind", mylist)
all$id <- rep(files.to.read, sapply(mylist, nrow))
I used the files.to.read vector as the id for each dataframe
I also changed from using merge_recurse() as it was very slow for some reason.
all <- merge_recurse(mylist)
Thanks everyone.
Personally, I think it's easier to add the names after collapse:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))
Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.
Both solutions below use the fact that lapply can pass extra parameters (mylist) to the function.
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...
# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist),
    function(i, x){
        x[[i]]$id <- i
        return (x[[i]])
    }, mylist
)
# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist),
function(n, x){
x[[n]]$id <- n
return (x[[n]])
}, mylist
)
names() could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:
> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
x y id
1 1 11 1
2 2 12 1
3 3 13 1
4 4 14 1
5 5 15 1
[[2]]
x y id
1 1 11 2
2 2 12 2
3 3 13 2
4 4 14 2
5 5 15 2
dlply function form plyr package could be an answer:
library('plyr')
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- ldply(mylist)
You could also use tidyverse, using lst instead of list which automatically names the list for you and then use imap:
library(tidyverse)
mylist <- dplyr::lst(df1, df2)
purrr::imap(mylist, ~mutate(.x, id = .y))
# $df1
# x y id
# 1 1 11 df1
# 2 2 12 df1
# 3 3 13 df1
# 4 4 14 df1
# 5 5 15 df1
# $df2
# x y id
# 1 1 11 df2
# 2 2 12 df2
# 3 3 13 df2
# 4 4 14 df2
# 5 5 15 df2

Working with dataframes in a list: Rename variables

Define:
dats <- list( df1 = data.frame(A=sample(1:3), B = sample(11:13)),
df2 = data.frame(AA=sample(1:3), BB = sample(11:13)))
s.t.
> dats
$df1
A B
1 2 12
2 3 11
3 1 13
$df2
AA BB
1 1 13
2 2 12
3 3 11
I would like to change all variable names from all caps to lower. I can do this with a loop but somehow cannot get this lapply call to work:
dats <- lapply(dats, function(x)
names(x)<-tolower(names(x)))
which results in:
> dats
$df1
[1] "a" "b"
$df2
[1] "aa" "bb"
while the desired result is:
> dats
$df1
a b
1 2 12
2 3 11
3 1 13
$df2
aa bb
1 1 13
2 2 12
3 3 11
If you don't use return at the end of a function, the last evaluated expression returned. So you need to return x.
dats <- lapply(dats, function(x) {
names(x)<-tolower(names(x))
x})

Combining (cbind) vectors of different length

I have several vectors of unequal length and I would like to cbind them. I've put the vectors into a list and I have tried to combine the using do.call(cbind, ...):
nm <- list(1:8, 3:8, 1:5)
do.call(cbind, nm)
# [,1] [,2] [,3]
# [1,] 1 3 1
# [2,] 2 4 2
# [3,] 3 5 3
# [4,] 4 6 4
# [5,] 5 7 5
# [6,] 6 8 1
# [7,] 7 3 2
# [8,] 8 4 3
# Warning message:
# In (function (..., deparse.level = 1) :
# number of rows of result is not a multiple of vector length (arg 2)
As expected, the number of rows in the resulting matrix is the length of the longest vector, and the values of the shorter vectors are recycled to make up for the length.
Instead I'd like to pad the shorter vectors with NA values to obtain the same length as the longest vector. I'd like the matrix to look like this:
# [,1] [,2] [,3]
# [1,] 1 3 1
# [2,] 2 4 2
# [3,] 3 5 3
# [4,] 4 6 4
# [5,] 5 7 5
# [6,] 6 8 NA
# [7,] 7 NA NA
# [8,] 8 NA NA
How can I go about doing this?
You can use indexing, if you index a number beyond the size of the object it returns NA. This works for any arbitrary number of rows defined with foo:
nm <- list(1:8,3:8,1:5)
foo <- 8
sapply(nm, '[', 1:foo)
EDIT:
Or in one line using the largest vector as number of rows:
sapply(nm, '[', seq(max(sapply(nm,length))))
From R 3.2.0 you may use lengths ("get the length of each element of a list") instead of sapply(nm, length):
sapply(nm, '[', seq(max(lengths(nm))))
You should fill vectors with NA before calling do.call.
nm <- list(1:8,3:8,1:5)
max_length <- max(unlist(lapply(nm,length)))
nm_filled <- lapply(nm,function(x) {ans <- rep(NA,length=max_length);
ans[1:length(x)]<- x;
return(ans)})
do.call(cbind,nm_filled)
This is a shorter version of Wojciech's solution.
nm <- list(1:8,3:8,1:5)
max_length <- max(sapply(nm,length))
sapply(nm, function(x){
c(x, rep(NA, max_length - length(x)))
})
Here is an option using stri_list2matrix from stringi
library(stringi)
out <- stri_list2matrix(nm)
class(out) <- 'numeric'
out
# [,1] [,2] [,3]
#[1,] 1 3 1
#[2,] 2 4 2
#[3,] 3 5 3
#[4,] 4 6 4
#[5,] 5 7 5
#[6,] 6 8 NA
#[7,] 7 NA NA
#[8,] 8 NA NA
Late to the party but you could use cbind.fill from rowr package with fill = NA
library(rowr)
do.call(cbind.fill, c(nm, fill = NA))
# object object object
#1 1 3 1
#2 2 4 2
#3 3 5 3
#4 4 6 4
#5 5 7 5
#6 6 8 NA
#7 7 NA NA
#8 8 NA NA
If you have a named list instead and want to maintain the headers you could use setNames
nm <- list(a = 1:8, b = 3:8, c = 1:5)
setNames(do.call(cbind.fill, c(nm, fill = NA)), names(nm))
# a b c
#1 1 3 1
#2 2 4 2
#3 3 5 3
#4 4 6 4
#5 5 7 5
#6 6 8 NA
#7 7 NA NA
#8 8 NA NA