Length of Python dictionary created doesn't match length from input file - python-2.7

I'm currently trying to create a dictionary from the following input file:
1776344_at 1779734_at 0.755332745 1.009570769 -0.497209846
1776344_at 1771911_at 0.931592828 0.830039019 2.28101445
1776344_at 1777458_at 0.746306282 0.753624146 3.709120716
...
...
There are a total of 12552 lines in this file.
What I wanted to do is to create a dictionary where the first 2 columns are the keys and the rest are the values. This I've successfully done and it looks something like this:
1770449_s_at;1777263_at:0.825723773;1.188969175;-2.858979578
1772892_at;1772051_at:-0.743866602;-1.303847456;26.41464414
1777227_at;1779218_s_at:0.819554413;0.677758609;4.51390617
But here's THE THING: I ran my python script on ms-dos cmd, and the generated output not only does not have the same sequence as that in the input file (i.e. 1st line is the 34th line), the whole file only has 739 lines.
Can someone enlighten me on what's going on? Is it something to do with memory? Cos the last I check I still have 305GB of disk space.
The script I wrote is as follow:
import sys
import os
input_file = sys.argv[1]
infile = open(input_file, 'r')
model_dict = {}
for line in infile:
key = ';'.join(line.split('\t')[0:2]).rstrip(os.linesep)
value = ';'.join(line.split('\t')[2:]).rstrip(os.linesep)
print 'keys are:',key,'\n','values are:',value
model_dict[key] = value
print model_dict
outfile = open('model_dict', 'w')
for key,value in model_dict.items():
print key,value
outfile.write('%s:%s\n' % (key,value))
outfile.close()

Based on the information given and since each dictionary key is unique, i suspect you have in the input file, lines that are generating the same key. This way the dictionary will only hold the last value associated with that key.
Python dictionaries are unordered set of key: value pairs. So when you print it's elements to the output file, don't expect that the order is preserved.
Another problem i see in your script is the loop that prints the output file, that shouldn't be "inside" the loop that reads from the input file.

Related

Can't generate proper file in python

I'm trying to generate a new file based on an existing one containing only lines with some predefined text. I have:
with open("steps_shown_at_least_once.log", "r") as f:
for line in f:
if line.find("Run program"):
output = open('run_studio.txt', 'a')
output.write(line)
output.close()
for some reason this generates an identical file. However the Run program that I'm searching for is not located in every row of the old file.
line.find('Run program') returns the index of a the string.
Return Value
Index if found and -1 otherwise.
Found here: Python String find() Method
Instead of line.find("Run program"): write if "Run program" in line:

Does csv.DictReader store file in memory?

I have to read a large CSV file almost of 100K rows in the file, also it will be very easier to process that file if I can read each file row in a dictionary format.
After little research I found python's built-in function csv.DictReader from the csv module.
But in the documentation it is not clear mentioned whether it stores whole file in memory or not.
But it has mentioned that:
The fieldnames parameter is a sequence whose elements are associated with the fields of the input data in order.
But I'm not sure whether sequence is stored in memory or not.
So the question is, does it store whole file in the memory?
If so, is there any other option to read single row as a generaror expression from the file and read get row as dict .
Here is my code:
def file_to_dictionary(self, file_path):
"""Read CSV rows as a dictionary """
file_data_obj ={}
try:
self.log("Reading file: [{}]".format(file_path))
if os.path.exists(file_path):
file_data_obj = csv.DictReader(open(file_path, 'rU'))
else:
self.log("File does not exist: {}".format(file_path))
except Exception as e:
self.log("Failed to read file.", e, True)
return file_data_obj
As far as im aware the DictReader object you create, in your case file_data_obj, is a generator type object.
Generator objects are not stored in memory but can only be iterated over once!
To print the fieldnames of your data as a list you can simply use: print file_data_obj.fieldnames
Secondly, in my experience I find it much easier to use a list of dictionaries when reading data from csv files, where each dictionary represents a row in your file. Consider the following:
def csv_to_dict_list(path):
csv_in = open(path, 'rb')
reader = csv.DictReader(csv_in, restkey=None, restval=None, dialect='excel')
fields = reader.fieldnames
list_out = [row for row in reader]
return list_out, fields
Using the function above (or something similar), you can acheive your goal with a couple of lines. Eg:
data, data_fields = csv_to_dict_list(path)
print data_fields (prints fieldnames)
print data[0] (prints first row of data from file)
Hope this helps!
Luke

Hello I have a code that prints what I need in python but i'd like it to write that result to a new file

The file look like a series of lines with IDs:
aaaa
aass
asdd
adfg
aaaa
I'd like to get in a new file the ID and its occurrence in the old file as the form:
aaaa 2
asdd 1
aass 1
adfg 1
With the 2 element separated by tab.
The code i have print what i want but doesn't write in a new file:
with open("Only1ID.txt", "r") as file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print item.title(), file.count(item)
As you use Python 2, the simplest approach to convert your console output to file output is by using the print chevron (>>) syntax which redirects the output to any file-like object:
with open("filename", "w") as f: # open a file in write mode
print >> f, "some data" # print 'into the file'
Your code could look like this after simply adding another open to open the output file and adding the chevron to your print statement:
with open("Only1ID.txt", "r") as file, open("output.txt", "w") as out_file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print >> out_file item.title(), file.count(item)
However, your code has a few other more or less bad things which one should not do or could improve:
Do not use the same variable name file for both the file object returned by open and your processed list of strings. This is confusing, just use two different names.
You can directly iterate over the file object, which works like a generator that returns the file's lines as strings. Generators process requests for the next element just in time, that means it does not first load the whole file into your memory like file.readlines() and processes them afterwards, but only reads and stores one line at a time, whenever the next line is needed. That way you improve the code's performance and resource efficiency.
If you write a list comprehension, but you don't need its result necessarily as list because you simply want to iterate over it using a for loop, it's more efficient to use a generator expression (same effect as the file object's line generator described above). The only syntactical difference between a list comprehension and a generator expression are the brackets. Replace [...] with (...) and you have a generator. The only downside of a generator is that you neither can find out its length, nor can you access items directly using an index. As you don't need any of these features, the generator is fine here.
There is a simpler way to remove trailing newline characters from a line: line.rstrip() removes all trailing whitespaces. If you want to keep e.g. spaces, but only want the newline to be removed, pass that character as argument: line.rstrip("\n").
However, it could possibly be even easier and faster to just not add another implicit line break during the print call instead of removing it first to have it re-added later. You would suppress the line break of print in Python 2 by simply adding a comma at the end of the statement:
print >> out_file item.title(), file.count(item),
There is a type Counter to count occurrences of elements in a collection, which is faster and easier than writing it yourself, because you don't need the additional count() call for every element. The Counter behaves mostly like a dictionary with your items as keys and their count as values. Simply import it from the collections module and use it like this:
from collections import Counter
c = Counter(lines)
for item in c:
print item, c[item]
With all those suggestions (except the one not to remove the line breaks) applied and the variables renamed to something more clear, the optimized code looks like this:
from collections import Counter
with open("Only1ID.txt") as in_file, open("output.txt", "w") as out_file:
counter = Counter(line.lower().rstrip("\n") for line in in_file)
for item in sorted(counter):
print >> out_file item.title(), counter[item]

Facing issue with for loop

I am trying to get this function to read an input file and output the lines from the input file into a new file. Pycharm keeps saying 'item' is not being used or it was used in the first for loop. I don't see why 'item' is a problem. It also won't create the new file.
input_list = 'persist_output_input_file_test.txt'
def persist_output(input_list):
input_file = open(input_list, 'rb')
lines = input_file.readlines()
input_file.close()
for item in input_list:
write_new_file = open('output_word.txt', 'wb')
for item in lines:
print>>input_list, item
write_new_file.close()
You have a few things going wrong in your program.
input_list seems to be a string denoting the name of a file. Currently you are iterating over the characters in the string with for item in input_list.
You shadow the already created variable item in your second for loop. I recommend you change that.
In Python, depending on which version you use, the correct syntax for printing a statement to the screen is print text(Python 2) or print(text)(Python 3). Unlike c++'s std::cout << text << endl;. << and >> are actually bit wise operators in Python that shift the bits either to the left or to the right.
There are a few issues in your implementation. Refer the following code for what you intend to do:
def persist_output(input_list):
input_file = open(input_list, 'rb')
lines = input_file.readlines()
write_new_file = open('output_word.txt', 'wb')
input_file.close()
for item in lines:
print item
write_new_file.write(item);
The issues with your earlier implementation are as follows:
In the first loop you are iterating in the input file name. If you intend to keep input_list a list of input files to be read, then you will also have to open them. Right now, the loop iterates through the characters in the input file name.
You are opening the output file in a loop. So, Only the last write operation will be successful. You would have to move the the file opening operation outside the loop(Ref: above code snippet) or edit the mode to 'append'. This can be done as follows:
write_new_file = open('output_word.txt', 'a')
There is a syntax error with the way you are using print command.
f=open('yourfilename','r').read()
f1=f.split('\n')
p=open('outputfilename','w')
for i in range (len(f1)):
p.write(str(f1[i])+'\n')
p.close()
hope this helps.

Merge CSV row with a string match from a 2nd CSV file

I'm working with two large files; approximately 100K+ rows each and I want to search csv file #1 for a string contained in csv file#2, then join another string from csv file#1 to the row in csv file#2 based on the match criteria. Here's an example of the data I'm working with and my expected output:
File#1: String to be matched in file#2 is the 2nd element; 1st is to be appended to each matched row in file#2. (Integer to be appended is bold; string to be matched is italicized for clarity only)
row 1:
3604430123,mta0000cadd503c.mta.net
row 2:
3604434567,mta0000CADD5638.MTA.NET
row 3:
3606304758,mta00069234e9a51.DT.COM
File#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active
Desired Output joining bold integer string from file#1 to entire row in file#2 based on string match between file#1 and file#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active,3604430123
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active,3604434567
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active,3606304758
There are many instances where the case in the match string of file#1 doesn't match the case of file#2, however the characters match, thus case can be ignored for match critera. The character case does need to be preserved in file#2 after it is appended with the integer string from file#1.
I'm a python newb and I've been at this for a while and have scoured posts in SE, but can't seem to come up with working code that gets me to the point where I can just print out a line from file#2 that has been matched on the string in file#1. I've tried a few other methods, such as writing to a dictionary, using Dictreader, etc, but haven't been able to clear what appears to be simple errors in those methods, so I tried to strip this down to simple lists and get to the point where I can use a list comprehension to combine the data, then write that back to a file named output, which will eventually be written back to a csv file. Any help or suggestions would be greatly appreciated.
import csv
sg = []
fqdn = []
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for row in read:
sg.append(row)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for row in read1:
fqdn.append(row)
output = output.append([s[0] for s in sg if fqdn[1] in sg])
print output
Result after running this is:
None
Process finished with exit code 0
You should use a dictionary for file#1 than just a list, as matching is easier. Just turn fqdn into a dict and in your loop reading file#1 set your key-value pairs on the dict. I would use .lower() on the match key. This turns the key to lower case so you later only have to check if the lower-cased version of the field in file#2 is a key in the dictionary:
import csv
sg = []
fqdn = {}
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for dataset in read:
sg.append(dataset)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for to_append, to_match in read1:
fqdn[to_match.lower()] = to_append
for dataset in sg:
to_append = fqdn.get(dataset[2].lower()) # If the key matched, to_append now contains the string to append, else it becomes None
if to_append:
dataset.append(to_append) # Append the field
output.append(dataset) # Append the row to the result list
print(output)
You can then use csv.writer to create a csv file from the result.
Here's a brute force solution to solving this problem. For every line of the first file, you will search through every line of the second file until you find a match. The matched lines will be written out to the output.csv file in the format you specified using the csv writer.
import csv
with open('file1.csv', 'r') as file1:
with open('file2.csv', 'r') as file2:
with open('output.csv', 'w') as outfile:
writer = csv.writer(outfile)
reader1 = csv.reader(file1)
reader2 = csv.reader(file2)
for row in reader1:
if not row:
continue
for other_row in reader2:
if not other_row:
continue
# if we found a match, let's write it to the csv file with the id appended
if row[1].lower() == other_row[2].lower():
new_row = other_row
new_row.append(row[0])
writer.writerow(new_row)
continue
# reset file pointer to beginning of file
file2.seek(0)
You might be tempted to store the information in a data structure before writing it out to a file. In my experience, you always end up getting larger files in the future and may run into memory issues. I like to write things out to file as I find the matches in order to avoid this problem.