Unit Test with Pandas Dataframe to read *.csv files - unit-testing

I am often vertically concatenating many *.csv files in Pandas. So, everytime I do this, I have to check that all the files I am concatenating have the same number of columns. This became quite cumbersome since I had to figure out a way to ignore the files with more or less columns than what I tell it I need. eg. the first 10 files have 4 columns but then file #11 has 8 columns and file #54 has 7 columns. This means I have to load all files - even the files that have the wrong number of columns. I want to avoid loading those files and then trying to concatenate them vertically - I want to skip them completely.
So, I am trying to write a Unit Test with Pandas that will:
a. check the size of all the *.csv files in some folder
b. ONLY read in the files that have a pre-determined number of columns
c. print a message indicating the naems of the *.csv files have the wrong number of columns
Here is what I have (I am working in the folder C:\Users\Downloads):
import unittest
import pandas as pd
from os import listdir
# Create csv files:
df1 = pd.DataFrame(np.random.rand(10,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(10,3), columns = ['A', 'B', 'C'])
df1.to_csv('test1.csv')
df1.to_csv('test2.csv')
class Conct(unittest.TestCase):
"""Tests for `primes.py`."""
TEST_INP_DIR = 'C:\Users\Downloads'
fns = listdir(TEST_INP_DIR)
t_fn = fn for fn in fns if fn.endswith(".csv") ]
print t_fn
dfb = pd.DataFrame()
def setUp(self):
for elem in Conct.t_fn:
print elem
fle = pd.read_csv(elem)
try:
pd.concat([Conct.dfb,fle],axis = 0, join='outer', join_axes=None, ignore_index=True, verify_integrity=False)
except IOError:
print 'Error: unable to concatenate a file with %s columns.' % fle.shape[1]
self.err_file = fle
def tearDown(self):
del self.err_fle
if __name__ == '__main__':
unittest.main()
Problem:
I am gettingthis output:
['test1.csv', 'test2.csv']
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
The first print statement works - it is printing a list of *.csv files, as expected. But, for some reason, the second and third print statements do not work.
Also, the concatenation should not have gone through - the second file has 3 columns but the first one has got 4 columns. The IOerror line does not seem to be printing.
How can I use a Python unittest to check each of the *.csv files to make sure that they have the same number of columns before concatenation? And how can I print the appropriate error message at the correct time?

On second thought, instead of chunksize, just read in the first row and count the number of columns, then read and append everything with the correct number of columns. In short:
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
Here's the full version:
df1 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(2,3), columns = ['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df1.to_csv('test1.csv',index=False)
df2.to_csv('test2.csv',index=False)
df3.to_csv('test3.csv',index=False)
files = ['test1.csv', 'test2.csv', 'test3.csv']
df = pd.DataFrame()
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
In [54]: df
Out [54]:
A B C D
0 0.308734 0.242331 0.318724 0.121974
1 0.707766 0.791090 0.718285 0.209325
0 0.176465 0.299441 0.998842 0.077458
1 0.875115 0.204614 0.951591 0.154492
(Edit to add) Regarding the use of nrows for the test... line: The only point of the test line is to read in enough of the CSV so that on the next line we check if it has the right number of columns before reading in. In this test case, reading in the first row is sufficient to figure out if we have 3 or 4 columns, and it's inefficient to read in more than that, although there is no harm in leaving off the nrows=1 besides reduced efficiency.
In other cases (e.g. no header row and varying numbers of columns in the data), you might need to read in the whole CSV. In that case, you'd be better off doing it like this:
for f in files:
test = pd.read_csv( f )
if len( test.columns ) == 4:
df = df.append( test )
The only downside of that way is that you completely read in the datasets with 3 columns that you don't want to keep, but you also don't read in the good datasets twice that way. So that's definitely a better way if you don't want to use nrows at all. Ultimately, depends on what your actual data looks like as to which way is best for you, of course.

Related

use python to write to a specific column is a .csv file

I have a .csv file where I need to overwrite a certain column with new values from a list.
Let's say I have the list L1 = ['La', 'Lb', 'Lc'] that I want to write in column no. 5 of the .csv file.
If I run:
L1 = ['La', 'Lb', 'Lc']
import csv
with open(r'C:\LIST.csv','wb') as f:
w = csv.writer(f)
for i in L1:
w.writerow(i)
This will write the L1 values to the first and second column.
First column will be 'L', 'L', 'L' and second column 'a', 'b', 'c'
I could not find the syntax to write to a specific column each element from the list. (this is in Python 2.7). Thank you for your help!
(for this script I must use IronPython, and just the built in Libraries that comes with IronPython)
Although you could certainly use Python's built-in csv module to read the data, modify it, and write it out, I'd recommend the excellent tablib module:
from tablib import Dataset
csv = '''Col1,Col2,Col3,Col4,Col5,Col6,Col7
a1,b1,c1,d1,e1,f1,g1
a2,b2,c2,d2,e2,f2,g2
a3,b3,c3,d3,e3,f3,g3
'''
# Read a hard-coded string just for test purposes.
# In your code, you would use open('...', 'rt').read() to read from a file.
imported_data = Dataset().load(csv, format='csv')
L1 = ['La', 'Lb', 'Lc']
for i in range(len(L1)):
# Each row is a tuple, and tuples don't support assignment.
# Convert to a list first so we can modify it.
row = list(imported_data[i])
# Put our value in the 5th column (index 4).
row[4] = L1[i]
# Store the row back into the Dataset.
imported_data[i] = row
# Export to CSV. (Of course, you could write this to a file instead.)
print imported_data.export('csv')
# Output:
# Col1,Col2,Col3,Col4,Col5,Col6,Col7
# a1,b1,c1,d1,La,f1,g1
# a2,b2,c2,d2,Lb,f2,g2
# a3,b3,c3,d3,Lc,f3,g3

pandas 'outer' merge of multiple csvs using too much memory

I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!
I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.

Python: Write two columns in csv for many lines

I have two parameters like filename and time and I want to write them in a column in a csv file. These two parameters are in a for-loop so their value is changed in each iteration.
My current python code is the one below but the resulting csv is not what I want:
import csv
import os
with open("txt/scalable_decoding_time.csv", "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
filename = ["one","two", "three"]
time = ["1","2", "3"]
zipped_lists = zip(filename,time)
for row in zipped_lists:
print row
writer.writerow(row)
My csv file must be like below. The , must be the delimeter. So I must get two columns.
one, 1
two, 2
three, 3
My csv file now reads as the following picture. The data are stored in one column.
Do you know how to fix this?
Well, the issue here is, you are using writerows instead of writerow
import csv
import os
with open("scalable_decoding_time.csv", "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
level_counter = 0
max_levels = 3
filename = ["one","two", "three"]
time = ["1","2", "3"]
while level_counter < max_levels:
writer.writerow((filename[level_counter], time[level_counter]))
level_counter = level_counter +1
This gave me the result:
one,1
two,2
three,3
Output:
This is another solution
Put the following code into a python script that we will call sc-123.py
filename = ["one","two", "three"]
time = ["1","2", "3"]
for a,b in zip(filename,time):
print('{}{}{}'.format(a,',',b))
Once the script is ready, run it like that
python2 sc-123.py > scalable_decoding_time.csv
You will have the results formatted the way you want
one,1
two,2
three,3

How to improve the code with more elegant way and low memory consumed?

I have a dataset which the dimension is around 2,000 (rows) x 120,000 (columns).
And I'd like to pick up certain columns (~8,000 columns).
So the file dimension would be 2,000 (rows) x 8,000 (columns).
Here is the code written by a good man (I searched from stackoverflow but I am sorry I have forgotten his name).
import pandas as pd
df = pd.read_csv('...mydata.csv')
my_query = pd.read_csv('...myquery.csv')
df[list['Name'].unique()].to_csv('output.csv')
However, the result shows MemoryError in my console, which means the code may not work quite well.
So does anyone know how to improve the code with more efficient way to select the certain columns?
I think I found your source.
So, my solution use read_csv with arguments:
iterator=True - if True, return a TextFileReader to enable reading a file into memory piece by piece
chunksize=1000 - an number of rows to be used to “chunk” a file into pieces. Will cause an TextFileReader object to be returned
usecols=subset - a subset of columns to return, results in much faster parsing time and lower memory usage
Source.
I filter large dataset with usecols - I use only dataset (2 000, 8 000) instead (2 000, 120 000).
import pandas as pd
#read subset from csv and remove duplicate indices
subset = pd.read_csv('8kx1.csv', index_col=[0]).index.unique()
print subset
#use subset as filter of columns
tp = pd.read_csv('input.csv',iterator=True, chunksize=1000, usecols=subset)
df = pd.concat(tp, ignore_index=True)
print df.head()
print df.shape
#write to csv
df.to_csv('output.csv',iterator=True, chunksize=1000)
I use this snippet for testing:
import pandas as pd
import io
temp=u"""A,B,C,D,E,F,G
1,2,3,4,5,6,7"""
temp1=u"""Name
B
B
C
B
C
C
E
F"""
subset = pd.read_csv(io.StringIO(temp1), index_col=[0]).index.unique()
print subset
#use subset as filter of columns
df = pd.read_csv(io.StringIO(temp), usecols=subset)
print df.head()
print df.shape

Print columns of Pandas dataframe to separate files + dataframe with datetime (min/sec)

I am trying to print a Pandas dataframe's columns to separate *.csv files in Python 2.7.
Using this code, I get a dataframe with 4 columns and an index of dates:
import pandas as pd
import numpy as np
col_headers = list('ABCD')
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y"),periods=rows)
df2 = pd.DataFrame(np.random.randn(10, 4), index=dates, columns = col_headers)
df = df2.tz_localize('UTC') #this does not seem to be giving me hours/minutes/seconds
I then remove the index and set it to a separate column:
df['Date'] = df.index
col_headers.append('Date') #update the column keys
At this point, I just need to print all 5 columns of the dataframe to separate files. Here is what I have tried:
for ijk in range(0,len(col_headers)):
df.to_csv('output' + str(ijk) + '.csv', columns = col_headers[ijk])
I get the following error message:
KeyError: "[['D', 'a', 't', 'e']] are not in ALL in the [columns]"
If I say:
for ijk in range(0,len(col_headers)-1):
then it works, but it does not print the 'Date' clumn. That is not what I want. I need to also print the date column.
Questions:
How do I get it to print the 'Dates' column to a *.csv file?
How do I get the time with hours, minutes and seconds? If the number of
rows is changed from 10 to 5000, then will the seconds change from one row of the dataframe to the next?
EDIT:
- Answer for Q2 (See here) ==> in the case of my particular code, see this:
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y %H:%M"),periods=rows)
I don't quite understand your logic but the following is a simpler method to do it:
for col in df:
df[col].to_csv('output' + col + '.csv')
example:
In [41]:
for col in df2:
print('output' + col + '.csv')
outputA.csv
outputB.csv
outputC.csv
outputD.csv
outputDate.csv