How to rename a column of a data frame with part of the data frame identifier in R? - regex

I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce

You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution

thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")

Related

pandas 'outer' merge of multiple csvs using too much memory

I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!
I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.

How to combine two for loops working with netcdf lists?

I have problems combining multiple for loops. I will give an example with two of them, I would like to combine. If I know how to do it with two I will also be able to do it with multiple loops.
If anyone knows how to write this as lapply function that would also be nice.
require(ncdf4)
#### download files from this link to directory: (I just downloaded manually,two files are sufficient to answer the example)
#### ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/LWdown_daily_WFDEI/
setwd("C:/place_where_I_have_downloaded_my_files_from_link/")
temp = list.files(pattern="*.nc") #list imported netcdf files
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment
#### first loop - # select parameter out of netcdf files and combine into a List of 2
list_temp<-list() #create empty list before loop
for (t in temp[1:2]){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(t),"LWdown")[428,176,],xcoor=176,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
#### second loop # select parameter out of onenetcdf file per x-coordinate and combine into a List of 2
list_temp<-list() #create empty list before loop
for (x in 176:177){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(temp[1]),"LWdown")[428,x,],xcoor=x,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
How I tried to combine but didn't work:
#### combined loops
list_temp<-list()
for (t in temp[1:2]){for (x in 176:177){
#ncin<-list()
ncin<-nc_open(t)
list_temp[x][t]<-list(data.frame(LWdown=ncvar_get(ncin,"LWdown")[428,x,],x=x,y=428))
}}
LWdown_1to2<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
I already solved my problem. See below. But I am still curious how one could solve the two for loops as described above, so I will leave the question open an unanswered.
Here is my solution:
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf),"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(temp_netcdf,"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
### nc_close(temp_netcdf) #close nc file to prevent data loss and errors
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)

How do I iterate a loop over several data frames in a list in python

I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).

read table with spaces in one column

I am attempting to extract tables from very large text files (computer logs). Dickoa provided very helpful advice to an earlier question on this topic here: extracting table from text file
I modified his suggestion to fit my specific problem and posted my code at the link above.
Unfortunately I have encountered a complication. One column in the table contains spaces. These spaces are generating an error when I try to run the code at the link above. Is there a way to modify that code, or specifically the read.table function to recognize the second column below as a column?
Here is a dummy table in a dummy log:
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
5 AA(~region + state + county + city)BB(~region + state + county + city)CC(~1) 17 11111.11 0.0000000 5.621299e-01 22222.22
4 AA(~region + state + county)BB(~region + state + county)CC(~1) 14 22222.22 0.0000000 5.621299e-01 77777.77
12 AA(~region + state)BB(~region + state)CC(~1) 13 33333.33 0.0000000 5.621299e-01 44444.44
12 AA(~region)BB(~region)CC(~1) 6 44444.44 0.0000000 5.621299e-01 55555.55
>
> # the three lines below count the number of errors in the code above
Here is the R code I am trying to use. This code works if there are no spaces in the second column, the model column:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
I believe I must use the variables top and bottom to locate the table in the log because the log is huge, variable and complex. Also, not every table contains the same number of models.
Perhaps a regex expression could be used somehow taking advantage of the AA and the CC(~1) present in every model name, but I do not know how to begin. Thank you for any help and sorry for the follow-up question. I should have used a more realistic example table in my initial question. I have a large number of logs. Otherwise I could just extract and edit the tables by hand. The table itself is an odd object which I have only ever been able to export directly with capture.output, which would probably still leave me with the same problem as above.
EDIT:
All spaces seem to come right before and right after a plus sign. Perhaps that information can be used here to fill the spaces or remove them.
try inserting my.data$model <- gsub(" *\\+ *", "+", my.data$model) before read.table
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
my.data$model <- gsub(" *\\+ *", "+", my.data$model)
x <- read.table(text=my.data, comment.char = ">")

Write a list of lists to a table, with the names of each list as a column?

I have a fairly basic question about how to write a list to a file.
I have a list generated by Mfuzz acore function, that lists the names of all the probes I have in each of 20 clusters in the following format:
[[1]]
NAME MEM.SHIP
ILMN_X ILMN_X 0.9993195
.
.
.
[[20]]
NAME MEM.SHIP
ILMN_Y ILMN_Y 0.9982345
I want to convert it to a data frame and eventually to an output file where the list number is included as a column;
Like this:
CLUSTER NAME MEM.SHIP
1 ILMN_X 0.9993196
.
.
.
20 ILMN_Y 0.9982345
Where the CLUSTER column indicates which sub-list the probe belongs to. Each probe name can belong to multible sub-lists.
I have tried different things like suggestions in other posts to use plyr but I always just end up with a single list of all the variables without an indication of which sub-list they belonged to.
Thanks!
If your original list is called clstrs, I believe this is one solution:
do.call(rbind, lapply(seq_along(clstrs), function(i){
data.frame(CLUSTER=i, clstrs[[i]])
}))
Here's another way how to skin a cat.
# make some sample data
my.df <- data.frame(num = 1:10, val = runif(10))
my.list <- list(my.df, my.df, my.df, my.df, my.df, my.df)
# build index - count the number of rows in each list element that will be
# used to designate the rows based on their previous list affiliation
index <- lapply(my.list, nrow)
index <- rep(1:length(index), times = index)
# from here on it's basically what Nick did. rbind everything together and
# put some lipstick on and voila
my.out <- do.call("rbind", my.list)
my.out$index <- index
#or
my.out <- cbind(my.out, index)
I have a few minutes to spare so I did a quick benchmark using 10e5 rows for each data frame.
My solution with $index:
user system elapsed
0.81 0.27 1.08
Solution with cbind:
user system elapsed
19.92 0.42 20.38
Nick's solution:
user system elapsed
1.04 0.26 1.31