Applying predicates on a list in R - list

Given a list of values in R, what is a nice way to filter values in a list by a given predicate function?

It's not entirely clear whether you have a proper list object in R, or another type of object such as a data.frame or vector. Assuming you have a true list object, we can combine lapply and subset to do what you want. If you don't have a list, then there's no need for lapply.
set.seed(1)
#Fake data
dat <- list(a = data.frame(x = sample(1:10, 20, TRUE))
, b = data.frame(x = sample(1:10, 20, TRUE)))
#Apply the subset function over the list
lapply(dat, subset, x < 3)
$a
x
10 1
12 2
$b
x
4 2
7 1
14 2
18 2
#Example two
lapply(dat, subset, x %in% c(1,7,9))
$a
x
6 9
8 7
9 7
10 1
13 7
$b
x
3 7
7 1
9 9
15 9
16 7

Related

sort dataframe columns in R

Is there a way to sort dataframe columns in R. I tried with below, but the result is returning as character instead of dataframe
> asd <- data.frame(a = c("fsd","sdfsd"))
> asd <- with(asd, asd[order(a) , ])
> asd
[1] "fsd" "sdfsd"
Can we get in dataframe only?
Try this
a <- data.frame(x=LETTERS[1:5],y=c(5:1))
a[order(a$x),]
a[order(a$y),]
> a[order(a$x),]
x y
1 A 5
2 B 4
3 C 3
4 D 2
5 E 1
> a[order(a$y),]
x y
5 E 1
4 D 2
3 C 3
2 B 4
1 A 5

How can i save results in different column in a dataframe in python?

I have to compare a columns with all other columns in the dataframe. The column that i have to compare with others is located in position 4 so i write df.iloc[x,4] to take column values. Then i have to consider these values, multiply them with the values in the next column (for example df.iloc[x,5]), create a new column in the dataframe and save results. Then i have to repeat this procedure to the end the existing column (the original dataframe has 43 column, so the end it is the df.iloc[x,43] )
How can i do this in python?
If it is possibile can you do some examples? I try to put my code in the post but i 'm not good with my new phone.
I think you can use eq - compare filtered DataFrame with column E in position 4:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,8,9],
'G':[1,3,5],
'H':[5,3,6],
'I':[7,4,3]})
print (df)
A B C D E F G H I
0 1 4 7 1 5 7 1 5 7
1 2 5 8 3 3 8 3 3 4
2 3 6 9 5 6 9 5 6 3
print (df.iloc[:,5:].eq(df.iloc[:,4], axis=0))
F G H I
0 False False True False
1 False True True False
2 False False True False
If need multiple by column in position 4 use mul:
print (df.iloc[:,5:].mul(df.iloc[:,4], axis=0))
F G H I
0 35 5 25 35
1 24 9 9 12
2 54 30 36 18
Or if need multiple by shifted columns:
print (df.iloc[:,4:].mul(df.iloc[:,5:], axis=0, fill_value=1))
E F G H I
0 5.0 49 1 25 49
1 3.0 64 9 9 16
2 6.0 81 25 36 9

Modify certain cells depending on given conditions using pandas

assume I have a dataframe looks like below.
df = pd.DataFrame({
'name' : ['1st', '2nd', '3rd'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
'peter_03' : [7, 8, 9],
'roger_04' : [10,11, 12],
'ken_05' : [13, 14, 15],
})
df2 = df.set_index('name')
john_01 ken_05 mary_02 peter_03 roger_04
name
1st 1 13 4 7 10
2nd 2 14 5 8 11
3rd 3 15 6 9 12
Modify_List_col = ['mary_02','peter_03']
Modify_List_row = ['2nd'] # use tolist() to get this list from additional files
I only want to modify those cells in List_col and List_row. So I will get something like below, those cells are replaced by 'X'.
john_01 ken_05 mary_02 peter_03 roger_04
name
1st 1 13 4 7 10
2nd 2 14 X X 11
3rd 3 15 6 9 12
Does anyone know how to get the results in one line using pandas please?
You can use the loc method:
In[25]: df = pd.DataFrame(pd.np.arange(25).reshape(5,5)).set_index(0)
In[26]: df
Out[26]:
1 2 3 4
0
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
In[27]: df.loc[[10,15],[2,3,4]] = "x"
In[28]: df
Out[28]:
1 2 3 4
0
0 1 2 3 4
5 6 7 8 9
10 11 x x x
15 16 x x x
20 21 22 23 24
To do that, just set the column 0 as index, then select the portion of the dataframe with loc and assign the value "x".
It works in the same way for your last dataset:
In[51]: Modify_List_col = ['mary_02', 'peter_03']
Modify_List_row = ['2nd']
df.loc[Modify_List_row, Modify_List_col] = "X"
In[52]: df
Out[52]:
john_01 ken_05 mary_02 peter_03 roger_04
name
1st 1 13 4 7 10
2nd 2 14 X X 11
3rd 3 15 6 9 12
I hope this can help you.

Dataframes in a list; adding a new variable with name of dataframe

I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to.
Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order:
mylist <- llply(files.to.read, read.csv)
A few methods have been highlighted in several posts:
Working-with-dataframes-in-a-list-drop-variables-add-new-ones and
Using-lapply-with-changing-arguments
I have tried two similar methods, the first using the index list:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1,df2)
# Adds a new coloumn 'id' with a value of 5 to every row in every dataframe.
# I WANT to change the value based on the list index.
mylist1 <- lapply(mylist,
function(x){
x$id <- 5
return (x)
}
)
#Example of what I WANT, instead of '5'.
#> mylist1
#[[1]]
#x y id
#1 1 11 1
#2 2 12 1
#3 3 13 1
#4 4 14 1
#5 5 15 1
#
#[[2]]
#x y id
#1 1 11 2
#2 2 12 2
#3 3 13 2
#4 4 14 2
#5 5 15 2
The second attempts to pass the names() of the list.
# I WANT it to add a new coloumn 'id' with the name of the respective dataframe
# to every row in every dataframe.
mylist2 <- lapply(names(mylist),
function(x){
portfolio.results[[x]]$id <- "dataframe name here"
return (portfolio.results[[x]])
}
)
#Example of what I WANT, instead of 'dataframe name here'.
# mylist2
#[[1]]
#x y id
#1 1 11 df1
#2 2 12 df1
#3 3 13 df1
#4 4 14 df1
#5 5 15 df1
#
#[[2]]
#x y id
#1 1 11 df2
#2 2 12 df2
#3 3 13 df2
#4 4 14 df2
#5 5 15 df2
But the names() function doesn't work on a list of dataframes; it returns NULL.
Could I use seq_along(mylist) in the first example.
Any ideas or better way to handle the whole "merge with source id"
Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this.
files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE)
mylist <- llply(files.to.read, read.csv)
all <- do.call("rbind", mylist)
all$id <- rep(files.to.read, sapply(mylist, nrow))
I used the files.to.read vector as the id for each dataframe
I also changed from using merge_recurse() as it was very slow for some reason.
all <- merge_recurse(mylist)
Thanks everyone.
Personally, I think it's easier to add the names after collapse:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))
Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.
Both solutions below use the fact that lapply can pass extra parameters (mylist) to the function.
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...
# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist),
    function(i, x){
        x[[i]]$id <- i
        return (x[[i]])
    }, mylist
)
# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist),
function(n, x){
x[[n]]$id <- n
return (x[[n]])
}, mylist
)
names() could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:
> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
x y id
1 1 11 1
2 2 12 1
3 3 13 1
4 4 14 1
5 5 15 1
[[2]]
x y id
1 1 11 2
2 2 12 2
3 3 13 2
4 4 14 2
5 5 15 2
dlply function form plyr package could be an answer:
library('plyr')
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- ldply(mylist)
You could also use tidyverse, using lst instead of list which automatically names the list for you and then use imap:
library(tidyverse)
mylist <- dplyr::lst(df1, df2)
purrr::imap(mylist, ~mutate(.x, id = .y))
# $df1
# x y id
# 1 1 11 df1
# 2 2 12 df1
# 3 3 13 df1
# 4 4 14 df1
# 5 5 15 df1
# $df2
# x y id
# 1 1 11 df2
# 2 2 12 df2
# 3 3 13 df2
# 4 4 14 df2
# 5 5 15 df2

Working with dataframes in a list: Rename variables

Define:
dats <- list( df1 = data.frame(A=sample(1:3), B = sample(11:13)),
df2 = data.frame(AA=sample(1:3), BB = sample(11:13)))
s.t.
> dats
$df1
A B
1 2 12
2 3 11
3 1 13
$df2
AA BB
1 1 13
2 2 12
3 3 11
I would like to change all variable names from all caps to lower. I can do this with a loop but somehow cannot get this lapply call to work:
dats <- lapply(dats, function(x)
names(x)<-tolower(names(x)))
which results in:
> dats
$df1
[1] "a" "b"
$df2
[1] "aa" "bb"
while the desired result is:
> dats
$df1
a b
1 2 12
2 3 11
3 1 13
$df2
aa bb
1 1 13
2 2 12
3 3 11
If you don't use return at the end of a function, the last evaluated expression returned. So you need to return x.
dats <- lapply(dats, function(x) {
names(x)<-tolower(names(x))
x})