pandas 'outer' merge of multiple csvs using too much memory - python-2.7

I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!

I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.

Related

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

How to improve the code with more elegant way and low memory consumed?

I have a dataset which the dimension is around 2,000 (rows) x 120,000 (columns).
And I'd like to pick up certain columns (~8,000 columns).
So the file dimension would be 2,000 (rows) x 8,000 (columns).
Here is the code written by a good man (I searched from stackoverflow but I am sorry I have forgotten his name).
import pandas as pd
df = pd.read_csv('...mydata.csv')
my_query = pd.read_csv('...myquery.csv')
df[list['Name'].unique()].to_csv('output.csv')
However, the result shows MemoryError in my console, which means the code may not work quite well.
So does anyone know how to improve the code with more efficient way to select the certain columns?
I think I found your source.
So, my solution use read_csv with arguments:
iterator=True - if True, return a TextFileReader to enable reading a file into memory piece by piece
chunksize=1000 - an number of rows to be used to “chunk” a file into pieces. Will cause an TextFileReader object to be returned
usecols=subset - a subset of columns to return, results in much faster parsing time and lower memory usage
Source.
I filter large dataset with usecols - I use only dataset (2 000, 8 000) instead (2 000, 120 000).
import pandas as pd
#read subset from csv and remove duplicate indices
subset = pd.read_csv('8kx1.csv', index_col=[0]).index.unique()
print subset
#use subset as filter of columns
tp = pd.read_csv('input.csv',iterator=True, chunksize=1000, usecols=subset)
df = pd.concat(tp, ignore_index=True)
print df.head()
print df.shape
#write to csv
df.to_csv('output.csv',iterator=True, chunksize=1000)
I use this snippet for testing:
import pandas as pd
import io
temp=u"""A,B,C,D,E,F,G
1,2,3,4,5,6,7"""
temp1=u"""Name
B
B
C
B
C
C
E
F"""
subset = pd.read_csv(io.StringIO(temp1), index_col=[0]).index.unique()
print subset
#use subset as filter of columns
df = pd.read_csv(io.StringIO(temp), usecols=subset)
print df.head()
print df.shape

How do I iterate a loop over several data frames in a list in python

I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).

Unit Test with Pandas Dataframe to read *.csv files

I am often vertically concatenating many *.csv files in Pandas. So, everytime I do this, I have to check that all the files I am concatenating have the same number of columns. This became quite cumbersome since I had to figure out a way to ignore the files with more or less columns than what I tell it I need. eg. the first 10 files have 4 columns but then file #11 has 8 columns and file #54 has 7 columns. This means I have to load all files - even the files that have the wrong number of columns. I want to avoid loading those files and then trying to concatenate them vertically - I want to skip them completely.
So, I am trying to write a Unit Test with Pandas that will:
a. check the size of all the *.csv files in some folder
b. ONLY read in the files that have a pre-determined number of columns
c. print a message indicating the naems of the *.csv files have the wrong number of columns
Here is what I have (I am working in the folder C:\Users\Downloads):
import unittest
import pandas as pd
from os import listdir
# Create csv files:
df1 = pd.DataFrame(np.random.rand(10,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(10,3), columns = ['A', 'B', 'C'])
df1.to_csv('test1.csv')
df1.to_csv('test2.csv')
class Conct(unittest.TestCase):
"""Tests for `primes.py`."""
TEST_INP_DIR = 'C:\Users\Downloads'
fns = listdir(TEST_INP_DIR)
t_fn = fn for fn in fns if fn.endswith(".csv") ]
print t_fn
dfb = pd.DataFrame()
def setUp(self):
for elem in Conct.t_fn:
print elem
fle = pd.read_csv(elem)
try:
pd.concat([Conct.dfb,fle],axis = 0, join='outer', join_axes=None, ignore_index=True, verify_integrity=False)
except IOError:
print 'Error: unable to concatenate a file with %s columns.' % fle.shape[1]
self.err_file = fle
def tearDown(self):
del self.err_fle
if __name__ == '__main__':
unittest.main()
Problem:
I am gettingthis output:
['test1.csv', 'test2.csv']
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
The first print statement works - it is printing a list of *.csv files, as expected. But, for some reason, the second and third print statements do not work.
Also, the concatenation should not have gone through - the second file has 3 columns but the first one has got 4 columns. The IOerror line does not seem to be printing.
How can I use a Python unittest to check each of the *.csv files to make sure that they have the same number of columns before concatenation? And how can I print the appropriate error message at the correct time?
On second thought, instead of chunksize, just read in the first row and count the number of columns, then read and append everything with the correct number of columns. In short:
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
Here's the full version:
df1 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(2,3), columns = ['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df1.to_csv('test1.csv',index=False)
df2.to_csv('test2.csv',index=False)
df3.to_csv('test3.csv',index=False)
files = ['test1.csv', 'test2.csv', 'test3.csv']
df = pd.DataFrame()
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
In [54]: df
Out [54]:
A B C D
0 0.308734 0.242331 0.318724 0.121974
1 0.707766 0.791090 0.718285 0.209325
0 0.176465 0.299441 0.998842 0.077458
1 0.875115 0.204614 0.951591 0.154492
(Edit to add) Regarding the use of nrows for the test... line: The only point of the test line is to read in enough of the CSV so that on the next line we check if it has the right number of columns before reading in. In this test case, reading in the first row is sufficient to figure out if we have 3 or 4 columns, and it's inefficient to read in more than that, although there is no harm in leaving off the nrows=1 besides reduced efficiency.
In other cases (e.g. no header row and varying numbers of columns in the data), you might need to read in the whole CSV. In that case, you'd be better off doing it like this:
for f in files:
test = pd.read_csv( f )
if len( test.columns ) == 4:
df = df.append( test )
The only downside of that way is that you completely read in the datasets with 3 columns that you don't want to keep, but you also don't read in the good datasets twice that way. So that's definitely a better way if you don't want to use nrows at all. Ultimately, depends on what your actual data looks like as to which way is best for you, of course.

How to rename a column of a data frame with part of the data frame identifier in R?

I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")