I wrote a very simple program that was supposed to read a CSV and print all of the rows twice. However, when I ran the program, it printed all of the rows the first time, and nothing the second time.
Code:
import csv
csvfile = csv.reader(open(<path>, 'rb'))
print 'Attempt 1'
for row in csvfile:
print row
print 'Attempt 2'
for row in csvfile:
print row
Output:
Attempt 1
['a', 'b', 'c']
['d', 'e', 'f']
Attempt 2
Why is the code not printing the contents again the second time?
You need to rewind the open file:
import csv
csvfile = csv.reader(open(<path>, 'rb'))
print 'Attempt 1'
for row in csvfile:
print row
csvfile.seek(0, 0)
print 'Attempt 2'
for row in csvfile:
print row
This way it should work fine.
Correct me if I'm wrong but I'm pretty sure the csvfile variable you create is a generator object.
Generators are not stored in memory but can only be iterated over once!
Hope this helps,
Luke
Related
I have a .csv file where I need to overwrite a certain column with new values from a list.
Let's say I have the list L1 = ['La', 'Lb', 'Lc'] that I want to write in column no. 5 of the .csv file.
If I run:
L1 = ['La', 'Lb', 'Lc']
import csv
with open(r'C:\LIST.csv','wb') as f:
w = csv.writer(f)
for i in L1:
w.writerow(i)
This will write the L1 values to the first and second column.
First column will be 'L', 'L', 'L' and second column 'a', 'b', 'c'
I could not find the syntax to write to a specific column each element from the list. (this is in Python 2.7). Thank you for your help!
(for this script I must use IronPython, and just the built in Libraries that comes with IronPython)
Although you could certainly use Python's built-in csv module to read the data, modify it, and write it out, I'd recommend the excellent tablib module:
from tablib import Dataset
csv = '''Col1,Col2,Col3,Col4,Col5,Col6,Col7
a1,b1,c1,d1,e1,f1,g1
a2,b2,c2,d2,e2,f2,g2
a3,b3,c3,d3,e3,f3,g3
'''
# Read a hard-coded string just for test purposes.
# In your code, you would use open('...', 'rt').read() to read from a file.
imported_data = Dataset().load(csv, format='csv')
L1 = ['La', 'Lb', 'Lc']
for i in range(len(L1)):
# Each row is a tuple, and tuples don't support assignment.
# Convert to a list first so we can modify it.
row = list(imported_data[i])
# Put our value in the 5th column (index 4).
row[4] = L1[i]
# Store the row back into the Dataset.
imported_data[i] = row
# Export to CSV. (Of course, you could write this to a file instead.)
print imported_data.export('csv')
# Output:
# Col1,Col2,Col3,Col4,Col5,Col6,Col7
# a1,b1,c1,d1,La,f1,g1
# a2,b2,c2,d2,Lb,f2,g2
# a3,b3,c3,d3,Lc,f3,g3
I have a CSV file contains data reviews and I want to append it to list.
Here is a sample in my file.csv:
I love eating them and they are good for watching TV and looking at movies
This taffy is so good. It is very soft and chewy
I want save in a list all the words of the second line and print them:
['This', 'taffy', 'is', 'so', 'good.', 'It', 'is', 'very', 'soft', 'and', 'chewy']
I tried this:
import csv
with open('file.csv', 'r') as csvfile:
data = csv.reader(csvfile, delimiter=',')
texts = []
next(data)
for row in data:
texts.append(row[2])
print(texts)
My problem is it doesn't print anythings. Can anyone help here?.. Thanks in advance
Don't forget to import csv, if you want to save all the words in the second line, you have to enumerate the lines and take what you want, after that split them and save it in the list, like this:
import csv
texts = []
with open('csvfile.csv', 'r') as csvfile:
for i, line in enumerate(csvfile):
if i == 1:
for word in line.split():
texts.append(word)
print(texts)
$['This', 'taffy', 'is', 'so', 'good.', 'It', 'is', 'very', 'soft', 'and', 'chewy']
I want to replace one specific word, 'my' with 'your'. But seems my code can only change one appearance.
import csv
path1 = "/home/bankdata/levelout.csv"
path2 = "/home/bankdata/leveloutmodify.csv"
in_file = open(path1,"rb")
reader = csv.reader(in_file)
out_file = open(path2,"wb")
writer = csv.writer(out_file)
with open(path1, 'r') as csv_file:
csvreader = csv.reader(csv_file)
col_count = 0
for row in csvreader:
while row[col_count] == 'my':
print 'my is used'
row[col_count] = 'your'
#writer.writerow(row[col_count])
writer.writerow(row)
col_count +=1
let's say the sentences is
'my book is gone and my bag is missing'
the output is
your book is gone and my bag is missing
the second thing is I want to make it appear without comma separated:
print row
the output is
your,book,is,gone,and,my,bag,is,missing,
for the second problem, im still trying to find the correct one as it keeps giving me the same output with comma separated.
with open(path1) as infile, open(path2, "w") as outfile:
for row in infile:
outfile.write(row.replace(",", ""))
print row
it gives me the result:
your,book,is,gone,and,my,bag,is,missing
I send out this sentence to my Nao robot and the robot seems pronouncing awkwardly as there are commas in between each word.
I solved it by:
with open(path1) as infile, open(path2, "w") as outfile:
for row in infile:
outfile.write(row.replace(",", ""))
with open(path2) as out:
for row in out:
print row
It gives me what I want:
your book is gone and your bag is missing too
However, any better way to do it?
I am often vertically concatenating many *.csv files in Pandas. So, everytime I do this, I have to check that all the files I am concatenating have the same number of columns. This became quite cumbersome since I had to figure out a way to ignore the files with more or less columns than what I tell it I need. eg. the first 10 files have 4 columns but then file #11 has 8 columns and file #54 has 7 columns. This means I have to load all files - even the files that have the wrong number of columns. I want to avoid loading those files and then trying to concatenate them vertically - I want to skip them completely.
So, I am trying to write a Unit Test with Pandas that will:
a. check the size of all the *.csv files in some folder
b. ONLY read in the files that have a pre-determined number of columns
c. print a message indicating the naems of the *.csv files have the wrong number of columns
Here is what I have (I am working in the folder C:\Users\Downloads):
import unittest
import pandas as pd
from os import listdir
# Create csv files:
df1 = pd.DataFrame(np.random.rand(10,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(10,3), columns = ['A', 'B', 'C'])
df1.to_csv('test1.csv')
df1.to_csv('test2.csv')
class Conct(unittest.TestCase):
"""Tests for `primes.py`."""
TEST_INP_DIR = 'C:\Users\Downloads'
fns = listdir(TEST_INP_DIR)
t_fn = fn for fn in fns if fn.endswith(".csv") ]
print t_fn
dfb = pd.DataFrame()
def setUp(self):
for elem in Conct.t_fn:
print elem
fle = pd.read_csv(elem)
try:
pd.concat([Conct.dfb,fle],axis = 0, join='outer', join_axes=None, ignore_index=True, verify_integrity=False)
except IOError:
print 'Error: unable to concatenate a file with %s columns.' % fle.shape[1]
self.err_file = fle
def tearDown(self):
del self.err_fle
if __name__ == '__main__':
unittest.main()
Problem:
I am gettingthis output:
['test1.csv', 'test2.csv']
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
The first print statement works - it is printing a list of *.csv files, as expected. But, for some reason, the second and third print statements do not work.
Also, the concatenation should not have gone through - the second file has 3 columns but the first one has got 4 columns. The IOerror line does not seem to be printing.
How can I use a Python unittest to check each of the *.csv files to make sure that they have the same number of columns before concatenation? And how can I print the appropriate error message at the correct time?
On second thought, instead of chunksize, just read in the first row and count the number of columns, then read and append everything with the correct number of columns. In short:
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
Here's the full version:
df1 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(2,3), columns = ['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df1.to_csv('test1.csv',index=False)
df2.to_csv('test2.csv',index=False)
df3.to_csv('test3.csv',index=False)
files = ['test1.csv', 'test2.csv', 'test3.csv']
df = pd.DataFrame()
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
In [54]: df
Out [54]:
A B C D
0 0.308734 0.242331 0.318724 0.121974
1 0.707766 0.791090 0.718285 0.209325
0 0.176465 0.299441 0.998842 0.077458
1 0.875115 0.204614 0.951591 0.154492
(Edit to add) Regarding the use of nrows for the test... line: The only point of the test line is to read in enough of the CSV so that on the next line we check if it has the right number of columns before reading in. In this test case, reading in the first row is sufficient to figure out if we have 3 or 4 columns, and it's inefficient to read in more than that, although there is no harm in leaving off the nrows=1 besides reduced efficiency.
In other cases (e.g. no header row and varying numbers of columns in the data), you might need to read in the whole CSV. In that case, you'd be better off doing it like this:
for f in files:
test = pd.read_csv( f )
if len( test.columns ) == 4:
df = df.append( test )
The only downside of that way is that you completely read in the datasets with 3 columns that you don't want to keep, but you also don't read in the good datasets twice that way. So that's definitely a better way if you don't want to use nrows at all. Ultimately, depends on what your actual data looks like as to which way is best for you, of course.
My INPUT file:
1,boss,30
2,go,35
2,nan,45
3,fog,33
4,kd,55
4,gh,56
Output file should be:
1,boss,30
3,fog,33
Means my output file should be free from duplicates. I should delete the record which is repeating based on the column 1.
Code I tried:
source_rd = csv.writer(open("Non_duplicate_source.csv", "wb"),delimiter=d)
gok = set()
for rowdups in sort_src:
if rowdups[0] not in gok:
source_rd.writerow(rowdups)
gok.add( rowdups[0])
Output I got:
1,boss,30
2,go,35
3,fog,33
4,kd,55
What am I doing wrong?
You can just loop the file twice.
The first time through, count all the duplicates. Second time through fetch the ones of interest.
import csv
gok={}
with open(fn) as fin:
reader=csv.reader(fin)
for e in reader:
gok[e[0]]=gok.setdefault(e[0], 0)+1
with open(fn) as fin:
reader=csv.reader(fin)
for e in reader:
if gok[e[0]]==1:
print e
Prints:
['1', 'boss', '30']
['3', 'fog', '33']
The reason your method does not work is that once the second instance of the duplicate is seen, the first has already been written.