csv reader failing on new line without delimiter - django

I am building an application which accepts csv data as an uploaded file.
Here is the relevent part of my view:
def climate_upload(request):
...
reader = csv.reader(file, delimiter=';') # news csv reader instance
next(reader) # skip header row
for line in reader:
if not line:
continue
report = site_name
report.year = line[1]
report.month = line[2]
...
report.save()
file.close() # close file
...
So, this works fine on data which looks like this:
;"headers"
;2012;5;2012-06-01;27.1;24.5;29.8;26.8;85;0.8
;2012;5;2012-06-02;27.1;24.5;29.8;26.8;85;0.8
But fails on this:
"headers"
2012;5;2012-06-01;27.1;24.5;29.8;26.8;85;0.8
2012;5;2012-06-02;27.1;24.5;29.8;26.8;85;0.8
Note the lack of initial delimiter on each line.
Unfortunately MS Excel seems to spit out the second version. I assume that reader is not recognizing a new line as a delimiter. Is there some flag with reader which will force it to accept \n as a delimiter as well as ; ?
Any help much appreciated.

The delimiters or newlines aren't the problem - you're counting incorrectly.
The first element of a list has the index 0. So it should be
report.year = line[0]
report.month = line[1]
# etc.
I'm guessing you're running into a List index out of range exception on the last element (line[9]).

Related

How to manipulate multiple csv files with raw data starting from different row for each file?

I would like to format multiplecsv files, some of them have summaries before the raw data. Raw data can start at any row, but if “colname” is find at any row then raw data start there. I am using the Standard Libary csv module to read files and check if “colname” exist and extract the data from there. With the code below, print(data) always gives me data from the first row of the file. But I want to pull the data starting from where “colname” is found. If “colname” is not found I don’t want to read the data.
Root_dir=r”folder1”
for fname in os.listdir(root_dir):
file_path = os.path.join(root_dir, fname)
if fname.endswith(('.csv')):
n = 0
with open(file_path,'rU') as fp:
csv_reader = csv.reader(fp)
while True:
for line in csv_reader:
if line == " colname": continue
n = n + 1
data=line
print(data)
Your code's logic reads only skip lines that aren't exactly " colname", which has 2 problems:
You want to skip lines until AFTER you have seen "colname"; you could use a boolean variable to distinguish between these two situations
Not clear if your test for colname is correct; for example, if there isn't exactly one leading space, or the line has a trailing end-of-line character, would trip it up.

'~' leading to null results in python script

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line

Python insert characters into txt file without overwrite

I want to insert my Random()'s return value into txt file without overwrite ('a') and to a specific location, like at the sixt character, but when I execute this, Random is insert to the third line.
`def Modif_Files(p_folder_path):
Tab = []
for v_root, v_dir, v_files in os.walk(p_folder_path):
print v_files
for v_file in v_files:
file = os.path.join(p_folder_path, v_file)
#with open(file, 'r') as files:
#for lines in files.readlines():
#Tab.append([lines])
with open(file, 'a') as file:
file.write("\n add " + str(Random())) #Random = int
#file.close
def Random():
global last
last = last + 3 + last * last * last * last % 256
return last
def main ():
Modif_Files(Modif_Path, 5) # Put path with a txt file inside
if __name__ == '__main__':
main()
`
After going through few other posts, it seems it is not possible to write in the middle of beginning of a file directly without overwriting. To write in the middle you need to copy or read everything after the position where you want to insert. Then after inserting append the content you read to the file.
Source: How do I modify a text file in Python?
Okay, I found the solution ; with open(file, 'r+') as file:
r+ and it work like a charm :)
The given answer is incorrect and/or lacking significant detail. At the time of this question maybe it wasn't, but currently writing to specific positions within a file using Python IS possible. I came across this question and answer in my search for this exact issue - an update could be useful for others.
See below for a resolution.
def main():
file = open("test.txt", "rb")
filePos = 0
while True:
# Read the file character by character
char = file.read(1)
# When we find the char we want, break the loop and save the read/write head position.
# Since we're in binary, we need to decode to get it to proper format for comparison (or encode the char)
if char.decode('ascii') == "*":
filePos = file.tell()
break
# If no more characters, we're at the end of the file. Break the loop and end the program.
elif not char:
break
# Resolve open/unneeded file pointers.
file.close()
# Open the file in rb+ mode for writing without overwriting (appending).
fileWrite = open("test.txt", 'rb+')
# Move the read/write head to the location we found our char at.
fileWrite.seek(filePos - 1)
# Overwrite our char.
fileWrite.write(bytes("?", "ascii"))
# Close the file
fileWrite.close()
if __name__ == "__main__":
main()

Merge CSV row with a string match from a 2nd CSV file

I'm working with two large files; approximately 100K+ rows each and I want to search csv file #1 for a string contained in csv file#2, then join another string from csv file#1 to the row in csv file#2 based on the match criteria. Here's an example of the data I'm working with and my expected output:
File#1: String to be matched in file#2 is the 2nd element; 1st is to be appended to each matched row in file#2. (Integer to be appended is bold; string to be matched is italicized for clarity only)
row 1:
3604430123,mta0000cadd503c.mta.net
row 2:
3604434567,mta0000CADD5638.MTA.NET
row 3:
3606304758,mta00069234e9a51.DT.COM
File#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active
Desired Output joining bold integer string from file#1 to entire row in file#2 based on string match between file#1 and file#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active,3604430123
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active,3604434567
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active,3606304758
There are many instances where the case in the match string of file#1 doesn't match the case of file#2, however the characters match, thus case can be ignored for match critera. The character case does need to be preserved in file#2 after it is appended with the integer string from file#1.
I'm a python newb and I've been at this for a while and have scoured posts in SE, but can't seem to come up with working code that gets me to the point where I can just print out a line from file#2 that has been matched on the string in file#1. I've tried a few other methods, such as writing to a dictionary, using Dictreader, etc, but haven't been able to clear what appears to be simple errors in those methods, so I tried to strip this down to simple lists and get to the point where I can use a list comprehension to combine the data, then write that back to a file named output, which will eventually be written back to a csv file. Any help or suggestions would be greatly appreciated.
import csv
sg = []
fqdn = []
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for row in read:
sg.append(row)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for row in read1:
fqdn.append(row)
output = output.append([s[0] for s in sg if fqdn[1] in sg])
print output
Result after running this is:
None
Process finished with exit code 0
You should use a dictionary for file#1 than just a list, as matching is easier. Just turn fqdn into a dict and in your loop reading file#1 set your key-value pairs on the dict. I would use .lower() on the match key. This turns the key to lower case so you later only have to check if the lower-cased version of the field in file#2 is a key in the dictionary:
import csv
sg = []
fqdn = {}
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for dataset in read:
sg.append(dataset)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for to_append, to_match in read1:
fqdn[to_match.lower()] = to_append
for dataset in sg:
to_append = fqdn.get(dataset[2].lower()) # If the key matched, to_append now contains the string to append, else it becomes None
if to_append:
dataset.append(to_append) # Append the field
output.append(dataset) # Append the row to the result list
print(output)
You can then use csv.writer to create a csv file from the result.
Here's a brute force solution to solving this problem. For every line of the first file, you will search through every line of the second file until you find a match. The matched lines will be written out to the output.csv file in the format you specified using the csv writer.
import csv
with open('file1.csv', 'r') as file1:
with open('file2.csv', 'r') as file2:
with open('output.csv', 'w') as outfile:
writer = csv.writer(outfile)
reader1 = csv.reader(file1)
reader2 = csv.reader(file2)
for row in reader1:
if not row:
continue
for other_row in reader2:
if not other_row:
continue
# if we found a match, let's write it to the csv file with the id appended
if row[1].lower() == other_row[2].lower():
new_row = other_row
new_row.append(row[0])
writer.writerow(new_row)
continue
# reset file pointer to beginning of file
file2.seek(0)
You might be tempted to store the information in a data structure before writing it out to a file. In my experience, you always end up getting larger files in the future and may run into memory issues. I like to write things out to file as I find the matches in order to avoid this problem.

how to improve the speed of the python script

I'm very new to python. I'm working in the area of hydrology and I want to learn python to assist me with processing hydrological data.
At the moment I write a script to extract bits of information from a big data set. I have three csv files:
Complete_borelist.csv
Borelist_not_interested.csv
Elevation_info.csv
I want to create a file with has all the bores that are in complete_borelist.csv but not in borelist_not_interested.csv. I also want to grab some information from complete_borelist.csv and Elevation_info.csv for those bores which satisfy the first criteria.
My script is as follow:
not_interested_list =[]
outfile1 = open('output.csv','w')
outfile1.write('Station_ID,Name,Easting,Northing,Location_name,Elevation')
outfile1.write('\n')
with open ('Borelist_not_interested.csv','r') as f1:
for line in f1:
if not line.startswith('Station'): #ignore header
line = line.rstrip()
words = line.split(',')
station = words[0]
not_interested_list.append(station)
with open('Complete_borelist.csv','r') as f2:
next(f2) #ignore header
for line in f2:
line= line.rstrip()
words = line.split(',')
station = words[0]
if not station in not_interested_list:
loc_name = words[1]
easting = words[4]
northing = words[5]
outfile1.write(station+','+easting+','+northing+','+loc_name+',')
with open ('Elevation_info.csv','r') as f3:
next(f3) #ignore header
for line in f3:
line = line.rstrip()
data = line.split(',')
bore_id = data[0]
if bore_id == station:
elevation = data[4]
outfile1.write(elevation)
outfile1.write ('\n')
outfile1.close()
I have two issues with the script:
The first is the Elevation_info.csv doesn't have information for all the bore in the Complete_borelist.csv. When my loop get to the station where it can't find Elevation record for it, the script doesn't write "null" but continue to write the information for the next station in the same line. Can anyone help me to fix this please?
The second is my complete borelist is about >200000 rows and my script runs through them very slow. Can anyone have any suggestion to make it run faster?
Very much appreciated and sorry if my question is too long.
performance-wise, this has a couple of problems. The first one is that you are opening and re-reading the Elevation info for every line of the complete file.. Read the elevation info into a dictionary keyed upon the bore_id before you open the complete file. Then you can test the dictionary very fast to see if station is in it instead of re-reading.
The second performance issue is that you don't stop searching in the bore_id list once you find a match. The dictionary idea solves that too, but otherwise a break once you have a match would help a little.
For the null printing problem, you just need to outfile1.write("\n") if the bore_id is not in the dictionary. An else statement on the dictionary test does that. In the current code, an else closing the for loop would do it. Or even changing the indentation of that last write("\n").