Why must I run this code a few times before my entire .csv file is converted into a .yaml file? - python-2.7

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.

I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

Related

Null Byte appending while reading the file through Python pandas

I have created a script which will give you the match rows between the two files. Post that, I am returning the output file to a function, which will be used the file as input to create pivot using pandas.
But somehow, something seems to be wrong, below is the code snippet
def CreateSummary(file):
out_file = file
file_df = pd.read_csv(out_file) ## This function is appending NULL Bytes at
the end of the file
#print file_df.head(2)
The above code is giving me the error as
ValueError: No columns to parse from file
Tried another approach:
file_df = pd.read_csv(out_file,delim_whitespace=True,engine='python')
##This gives me error as
_csv.Error: line contains NULL byte
Any suggestions and criticism is highly appreciated.

Parse CSV efficiently in python

I am writing a CSV parser which has following structure
class decode:
def __init__(self):
self.fd = open('test.csv')
def decodeoperation(self):
for row in self.fd:
getcmd = self.decodecmd(row)
if cmd == 'A'
self.decodeAopt()
elif cmd == 'B':
self.decodeBopt()
def decodeAopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd A till
#a condition occurs on any further row
return
def decodeBopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd B till
#a condition occurs on any further row
return
The current code is working fine for me but I am not feeling good to iterate through the CSV file in all the methods. Could it be done in a better way?
There is nothing inherently wrong with using a common iterator across multiple methods, as long as you can determine in advance which method to dispatch to at any given point in the sequence (which you are doing by decoding the cmd from the row and getting 'A', 'B', etc.). The design has issues if you have to read several items before you could determine which method to call, and might have to back up if you picked the wrong method and needed to try another. In parsing, this is called backtracking. Since you are passing around a file object, backing up is difficult. Note that your separate decoder methods will have to know when to stop before reading the next row that contains a command, so they will need some sort of terminating sentinel row that they can recognize.
Some general comments on your Python and class design:
You have a nice simple if-elif-elif dispatch table that can translate to a Python dict like this:
# put this code in place of your "if cmd == ... elif elif elif..." code
dispatch = {
# note - no ()'s, we just want to reference the methods, not call them
'A': self.decodeAopt,
'B': self.decodeBopt,
'C': self.decodeCopt,
# look how easy it is to add more decoders
}
# lookup which decoder to use for the current cmd
decoder = dispatch[cmd]
# run it
decoder()
# or do it all in one line
dispatch[cmd]()
Instead of having your __init__ method open a file, let it accept an iterator object. This will make it much easier to write tests for your object, since you'll be able to pass simple Python lists containing CSV rows.
class decode:
def __init__(self, sequence):
self.fd = sequence
You might want to rename this var from 'fd' to something like 'seq', since it doesn't have to be a file, but could be any iterable that gives you decodable rows.
If you are doing your own CSV parsing, look at using the builtin csv module. It will do quite a bit of work for you, like parsing quoted strings that could contain commas, and can give you easy-to-work-with dicts for each row, given headers read from the input file, or specified by you. If you have modified __init__ as I suggested, you can use it like:
import csv
# assuming test.csv has a header row
reader = csv.DictReader(open('test.csv'))
# or specify headers if not - I encourage you to give these columns better names
reader.fieldnames = ['cmd', 'val1', 'val2', 'val3']
decoder = decode(reader)
decoder.decodeoperation()
Then you can write in decodeoperation:
cmd = row['cmd']
Note that this would impart a slightly different design to your class, that it would expect to be given a sequence of dicts, rather than a sequence of strings.

'~' leading to null results in python script

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line

Does csv.DictReader store file in memory?

I have to read a large CSV file almost of 100K rows in the file, also it will be very easier to process that file if I can read each file row in a dictionary format.
After little research I found python's built-in function csv.DictReader from the csv module.
But in the documentation it is not clear mentioned whether it stores whole file in memory or not.
But it has mentioned that:
The fieldnames parameter is a sequence whose elements are associated with the fields of the input data in order.
But I'm not sure whether sequence is stored in memory or not.
So the question is, does it store whole file in the memory?
If so, is there any other option to read single row as a generaror expression from the file and read get row as dict .
Here is my code:
def file_to_dictionary(self, file_path):
"""Read CSV rows as a dictionary """
file_data_obj ={}
try:
self.log("Reading file: [{}]".format(file_path))
if os.path.exists(file_path):
file_data_obj = csv.DictReader(open(file_path, 'rU'))
else:
self.log("File does not exist: {}".format(file_path))
except Exception as e:
self.log("Failed to read file.", e, True)
return file_data_obj
As far as im aware the DictReader object you create, in your case file_data_obj, is a generator type object.
Generator objects are not stored in memory but can only be iterated over once!
To print the fieldnames of your data as a list you can simply use: print file_data_obj.fieldnames
Secondly, in my experience I find it much easier to use a list of dictionaries when reading data from csv files, where each dictionary represents a row in your file. Consider the following:
def csv_to_dict_list(path):
csv_in = open(path, 'rb')
reader = csv.DictReader(csv_in, restkey=None, restval=None, dialect='excel')
fields = reader.fieldnames
list_out = [row for row in reader]
return list_out, fields
Using the function above (or something similar), you can acheive your goal with a couple of lines. Eg:
data, data_fields = csv_to_dict_list(path)
print data_fields (prints fieldnames)
print data[0] (prints first row of data from file)
Hope this helps!
Luke

Removing Duplicate Lines by Title Only

I am trying to modify a script so that it will remove duplicate lines from a text file using only the title portion of that line.
To clarify the text file lines look something like this:
Title|Image Url|Description|Page Url
At the moment the script does remove duplicates, but it does so by reading the entire line, not just the first part. All the lines in the file are not going to be 100% the same, but a few will be very similar.
I want to remove all of the lines that contain the same "title", regardless of what the rest of the line contains.
This is the script I am working with:
import sys
from collections import OrderedDict
infile = "testfile.txt"
outfile = "outfile.txt"
inf = open(infile,"r")
lines = inf.readlines()
inf.close()
newset = list(OrderedDict.fromkeys(lines))
outf = open(outfile,"w")
lstline = len(newset)
for i in range(0,lstline):
ln = newset[i]
outf.write(ln)
outf.close()
So far I have tried using .split() to split the lines in the list. I have also tried .readline(lines[0:25]) in hopes of using a character limit to achieve the desired results, but no luck so far. I also can't seem to find any documentation on my exact problem so I'm stuck.
I am using Windows 8 and Python 2.7.9 for this project if that helps.
I made a few changes to the program you had set up. First, I changed your file interactions to use "with" statements, since those are very convenient and automatically handle a lot of the functionality you had to write out. Second off, I used a set instead of an OrderedDict because you were basically just trying to emulate set functionality (exclusivity of elements) by using keys in an OrderedDict. If the title hasn't been used, it adds it to the set so it can't be used again and prints the line to the output file. If it has been used, it keeps going. I hope this helps you!
with open("testfile.txt") as infile:
with open("outfile.txt",'w') as outfile:
titleset = set()
for line in infile:
title = line.split('|')[0]
if title not in titleset:
titleset.add(title)
outfile.write(line)