I have to read a large CSV file almost of 100K rows in the file, also it will be very easier to process that file if I can read each file row in a dictionary format.
After little research I found python's built-in function csv.DictReader from the csv module.
But in the documentation it is not clear mentioned whether it stores whole file in memory or not.
But it has mentioned that:
The fieldnames parameter is a sequence whose elements are associated with the fields of the input data in order.
But I'm not sure whether sequence is stored in memory or not.
So the question is, does it store whole file in the memory?
If so, is there any other option to read single row as a generaror expression from the file and read get row as dict .
Here is my code:
def file_to_dictionary(self, file_path):
"""Read CSV rows as a dictionary """
file_data_obj ={}
try:
self.log("Reading file: [{}]".format(file_path))
if os.path.exists(file_path):
file_data_obj = csv.DictReader(open(file_path, 'rU'))
else:
self.log("File does not exist: {}".format(file_path))
except Exception as e:
self.log("Failed to read file.", e, True)
return file_data_obj
As far as im aware the DictReader object you create, in your case file_data_obj, is a generator type object.
Generator objects are not stored in memory but can only be iterated over once!
To print the fieldnames of your data as a list you can simply use: print file_data_obj.fieldnames
Secondly, in my experience I find it much easier to use a list of dictionaries when reading data from csv files, where each dictionary represents a row in your file. Consider the following:
def csv_to_dict_list(path):
csv_in = open(path, 'rb')
reader = csv.DictReader(csv_in, restkey=None, restval=None, dialect='excel')
fields = reader.fieldnames
list_out = [row for row in reader]
return list_out, fields
Using the function above (or something similar), you can acheive your goal with a couple of lines. Eg:
data, data_fields = csv_to_dict_list(path)
print data_fields (prints fieldnames)
print data[0] (prints first row of data from file)
Hope this helps!
Luke
Related
I have gone through similar questions but am having trouble fitting this to my needs. I am reading a csv, creating a list and appending the list to a seperate csv.
with open('in_table.csv', 'rb') as vo:
next(vo) # skip header row
reader = csv.reader(vo)
vo_list = list(reader)
print vo_list
with open('out_table.csv', 'ab') as f:
cf = csv.writer(f)
for row in vo_list:
cf.writerow(row)
I need to write the list starting at the second column and not the first, as the first column will contain separate information. What is the simplest way to do this?
Realistically I have another input CSV exactly like the first one and I need to put them both into the output file into a total of 4 columns. Like so:
Column1, join_count1, grid_id1, join_count2, grid_id2
Blah, 0, U24, 3, U24
I would go with the built-in csv package. Also, you are opening CSV files as binary files, was that intentional? CSVs should be text files by definition, but if yours are binary then please correct the flags below:
import csv
with open("out_table.csv", "a+") as out_file:
writer = csv.writer(out_file)
with open("in_table.csv") as in_file:
reader = csv.reader(in_file)
next(reader) # skip the header
for oid, join_count, grid_id in reader:
writer.writerow([join_count, grid_id])
I am writing a CSV parser which has following structure
class decode:
def __init__(self):
self.fd = open('test.csv')
def decodeoperation(self):
for row in self.fd:
getcmd = self.decodecmd(row)
if cmd == 'A'
self.decodeAopt()
elif cmd == 'B':
self.decodeBopt()
def decodeAopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd A till
#a condition occurs on any further row
return
def decodeBopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd B till
#a condition occurs on any further row
return
The current code is working fine for me but I am not feeling good to iterate through the CSV file in all the methods. Could it be done in a better way?
There is nothing inherently wrong with using a common iterator across multiple methods, as long as you can determine in advance which method to dispatch to at any given point in the sequence (which you are doing by decoding the cmd from the row and getting 'A', 'B', etc.). The design has issues if you have to read several items before you could determine which method to call, and might have to back up if you picked the wrong method and needed to try another. In parsing, this is called backtracking. Since you are passing around a file object, backing up is difficult. Note that your separate decoder methods will have to know when to stop before reading the next row that contains a command, so they will need some sort of terminating sentinel row that they can recognize.
Some general comments on your Python and class design:
You have a nice simple if-elif-elif dispatch table that can translate to a Python dict like this:
# put this code in place of your "if cmd == ... elif elif elif..." code
dispatch = {
# note - no ()'s, we just want to reference the methods, not call them
'A': self.decodeAopt,
'B': self.decodeBopt,
'C': self.decodeCopt,
# look how easy it is to add more decoders
}
# lookup which decoder to use for the current cmd
decoder = dispatch[cmd]
# run it
decoder()
# or do it all in one line
dispatch[cmd]()
Instead of having your __init__ method open a file, let it accept an iterator object. This will make it much easier to write tests for your object, since you'll be able to pass simple Python lists containing CSV rows.
class decode:
def __init__(self, sequence):
self.fd = sequence
You might want to rename this var from 'fd' to something like 'seq', since it doesn't have to be a file, but could be any iterable that gives you decodable rows.
If you are doing your own CSV parsing, look at using the builtin csv module. It will do quite a bit of work for you, like parsing quoted strings that could contain commas, and can give you easy-to-work-with dicts for each row, given headers read from the input file, or specified by you. If you have modified __init__ as I suggested, you can use it like:
import csv
# assuming test.csv has a header row
reader = csv.DictReader(open('test.csv'))
# or specify headers if not - I encourage you to give these columns better names
reader.fieldnames = ['cmd', 'val1', 'val2', 'val3']
decoder = decode(reader)
decoder.decodeoperation()
Then you can write in decodeoperation:
cmd = row['cmd']
Note that this would impart a slightly different design to your class, that it would expect to be given a sequence of dicts, rather than a sequence of strings.
I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.
I have a list of ordered tuples which each tuple contains column name and value pair to be written to a csv for example
lst = [('name','bob'),('age',19),('loc','LA')]
which has in for for bob, age 19 and location, loc, in LA. I want to be able to write this to CSV file based on column names and sometimes some of these columns are missing, for example for another row.
lst2 = [('name','bob'),('loc','LA')]
age is missing, how I can write these rows properly in python to a csv?
Those tuples can be used to initialize a dict so csv.DictWriter seems the best choice. In this example I create a dict filled with default values. For each list of tuples, I copy the dict, update with the known values and write it out.
import csv
# sample data
lst = [('name','bob'),('age',19),('loc','LA')]
lst2 = [('name','jane'),('loc','LA')]
lists = [lst, lst2]
# columns need some sort of default... I just guessed
defaults = {'name':'', 'age':-1, 'loc':'N/A'}
with open('output.csv', 'wb') as outfile:
writer = csv.DictWriter(outfile, fieldnames=sorted(defaults.keys()))
writer.writeheader()
for row_tuples in lists:
# copy defaults then update with known values
kv = defaults.copy()
kv.update(row_tuples)
writer.writerow(kv)
# debug...
print open('output.csv').read()
You should give more examples, as to what exactly is required- as what if the location is not given in ls2 then what do you want to write to your csv? From what I understand, you can make a function and default argument:
import csv
def write_tuples_to_csv(name="DefaultName", age="DefaultAge", loc="Default location"):
writer = csv.writer(open("/path/to/csv/file", 'a')) # appending to a file
row = (name, age, loc)
writer.writerow(['name','num','location'])
writer.writerow(row)
Now you can call this function for every item in the list. This should help you to get you started.
I've got a csv with:
T,8,101
T,10,102
T,5,103
and need to search the csv file, in the 3rd column for my input, and if found, return the 2nd column value in that same row (searching "102" would return "10"). I then need to save the result to use in another calculation. (I am just trying to print the result for now..) I am new to python (2 weeks) and wanted to get a grasp on reading/writing in csv files. All the searchable results, didn't give me the answer I needed. Thanks
Here is my code:
name = input("waiting")
import csv
with open('cards.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
if row[2] == name:
print(row[1])
As stated in my comment above I would implement a general approach without using the csv-module like this:
import io
s = """T,8,101
T,10,102
T,5,103"""
# use io.StringIO to get a file-like object
f = io.StringIO(s)
lines = [tuple(line.split(',')) for line in f.read().splitlines()]
example_look_up = '101'
def find_element(look_up):
for t in lines:
if t[2] == look_up:
return t[1]
result = find_element(example_look_up)
print(result)
Please keep in mind, that this is Python3-code. You need to replace print() with print if using with Python2 and maybe change something related to the StringIO which I am using for demonstration purposes here in order to get a file-like object. However, this snippet should give you a basic idea about a possible solution.