Python dictionary construction from file with multiple similar values and keys - list

I am new to python (well to coding in general) and am trying to use it to analyze some data at work. I have a file like this:
HWI-ST591_0064:5:1101:1228:2111#0/1 + 7included 11 A>G - -
HWI-ST591_0064:5:1101:1205:2125#0/1 + genomic 17 A>G - -
HWI-ST591_0064:5:1101:1178:2129#0/1 + 7included 6 A>C 8 A>T
HWI-ST591_0064:5:1101:1176:2164#0/1 + 7included 6 A>T 8 A>G
HWI-ST591_0064:5:1101:1199:2234#0/1 + 7included 14 T>C 21 G>A
HWI-ST591_0064:5:1101:1208:2249#0/1 + 7included 32 C>T - -
Tab delimited. I am trying to create a dictionary that contains the first value of the line (a unique identifier) as a list of values that matches the joined last 4 values as the key, like this:
{'32C>T--': ['HWI-ST591_0064:5:1101:1208:2249#0/1'],
'6A>C8A>C': ['HWI-ST591_0064:5:1101:1318:2090#0/1'],
'36A>G--': ['HWI-ST591_0064:5:1101:1425:2093#0/1'],
'----': ['HWI-ST591_0064:5:1101:1222:2225#0/1'],
'6A>C8A>T': ['HWI-ST591_0064:5:1101:1178:2129#0/1','HWIST591_0064:5:1101:1176:2164#0/1']}
This way I can then get a list of the unique identifies and count or sort or do the other things I need to do. I can get the dictionary made, but when I try to output it to a file I get an error. I think the problem is because this is a list, I keep getting the error
File "trial.py", line 33, in
outFile.write("%s\t%s\n" % ('\t' .join(key, mutReadDict[key])))
TypeError: unhashable type: 'list'
Is there a way to make this work so I can have it in a file? I tried .iteritems() on the for loop making the dictionary but that didn't seem to work. Thanks and here is my code:
inFile = open('path', 'rU')
outFile = open('path', 'w')
from collections import defaultdict
mutReadDict = defaultdict(list)
for line in inFile:
entry = line.strip('\n').split('\t')
fastQ_ID = entry[0]
strand = entry[1]
chromosome = entry[2]
mut1pos = entry[3]
mut1base = entry[4]
mut2pos = entry[5]
mut2base = entry[6]
mutKey = mut1pos + mut1base + mut2pos + mut2base
if chromosome == '7included':
mutReadDict[mutKey].append(fastQ_ID)
else:
pass
keyList = [mutReadDict.keys()]
keyList.sort()
for key in keyList:
outFile.write("%s\t%s\n" % ('\t' .join(key, mutReadDict[key])))
outFile.close()

I think you want:
keyList = mutReadDict.keys()
instead of
keyList = [mutReadDict.keys()]
You probably mean this too:
for key in keyList:
outFile.write("%s\t%s\n" % (key, '\t'.join(mutReadDict[key])))

Related

out of bounds error when using a list as an index

I have two files: one is a single column (call it pred) and has no headers, the other has two columns: ID and IsClick (it has headers). My goal is to use the column ID as an index to pred.
import pandas as pd
import numpy as np
def LinesInFile(path):
with open(path) as f:
for linecount, line in enumerate(f):
pass
f.close()
print 'Found ' + str(linecount) + ' lines'
return linecount
path ='/Users/mas/Documents/workspace/Avito/input/' # path to testing file
submission = path + 'submission1234.csv'
lines = LinesInFile(submission)
lines = LinesInFile(path + 'sampleSubmission.csv')
sample = pd.read_csv(path + 'sampleSubmission.csv')
preds = np.array(pd.read_csv(submission, header = None))
index = sample.ID.values - 1
print index
print len(index)
sample['IsClick'] = preds[index]
sample.to_csv('submission.csv', index=False)
The output is:
Found 7816360 lines
Found 7816361 lines
[ 0 4 5 ..., 15961507 15961508 15961511]
7816361
Traceback (most recent call last):
File "/Users/mas/Documents/workspace/Avito/July3b.py", line 23, in <module>
sample['IsClick'] = preds[index]
IndexError: index 7816362 is out of bounds for axis 0 with size 7816361
there seems something wrong because my file has 7816361 lines counting the header while my list has an extra element (len of list 7816361)
I don't have your csv files to recreate the problem, but the problem looks like it is being caused by your use of index.
index = sample.ID.values - 1 is taking each of your sample ID's and subtracting 1. These are not index values in pred as it is only 7816360 long. Each of the last 3 items in your index array (based on your print output) would go out of bounds as they are >7816360. I suspect the error is showing you the first of your ID-1 that go out of bounds.
Assuming you just want to join the files based on their line number you could do the following:
sample=pd.concat((pd.read_csv(path + 'sampleSubmission.csv'),pd.read_csv(submission, header = None).rename(columns={0:'IsClick'})),axis=1)
Otherwise you'll need to perform a join or merge on your two dataframes.

im trying to fix an error on my code when adding number to another number in a file

I have a file like this:
RubyWilson,20,174.0,female,23.45,1562.41,**367**
I'm trying to add a number to the last number in the file
like this:
number = 300
RubyWilson,20,174.0,female,23.45,1562.41,**667**
This is what I've tried so far:
name = input("name")
FitnessFile = open((name + "fitness file.csv") , "r")
myVar = FitnessFile.read()
FitnessFile.close()
myList = myVar.split(",")
number = int(input("enter number"))
str(myList[6])) = int(myList[6]) + (number)
FitnessFile = open((name + "fitness file.csv") , "w")
addList = ",".join(myList)
FitnessFile.write(addList)
FitnessFile.close()
When I run it, it says can't assign function call
on line 6.
How do I fix this?
I suspect you want to replace
str(myList[6])) = int(myList[6]) + (number)
which, by the way, isn't even syntactically valid, with
myList[6] = str(int(myList[6]) + (number))
However, none of this takes care of the asterisks. Do they actually appear in your data?

Python2.7: Too many Values to Unpack - Number of Columns unknown

I have a file that I want to unpack and utilise the columns in different files. The issue I have is that the file I want to unpack varies from row to row on the number of columns it has (for example row 1 could have 7 columns, row 2 could have 15).
How do I unpack the file without receiving the error "Too many values to unpack"?
filehandle3 = open ('output_steps.txt', 'r')
filehandle4 = open ('head_cluster.txt', 'w')
for line in iter(filehandle3):
id, category = line.strip('\n').split('\t')
filehandle4.write(id + "\t" + category + "\n")
filehandle3.close()
filehandle4.close()
Any help would be great. Thanks!
You should extract the values separately, if present, e.g. like this:
for line in iter(filehandle3):
values = line.strip('\n').split('\t')
id = values[0] if len(values) > 0 else None
category = values[1] if len(values) > 1 else None
...
You could also create a helper function for this:
def safe_get(values, index, default=None):
return values[index] if len(values) > index else default
or using try/except:
def safe_get(values, index, default=None):
try:
return values[index]
except IndexError:
return default
and use it like this:
category = safe_get(values, 1)
With Python 3, and if the rows always have at least as many elements as you need, you can use
for line in iter(filehandle3):
id, category, *junk = line.strip('\n').split('\t')
This will bind the first element to id, the second to category, and the rest to junk.

Append new column by subtracting existing column in csv by using python

I tried to append new column to an existing csv file using python. it is not showing any error but the column is not created.
I have a CSV file with 5 columns and I want to add data in the 6th column by subtracting between existing columns.
ID,SURFACES,A1X,A1Y,A1Z,A2X
1,GROUND,800085.3323,961271.977,-3.07E-18,800080.8795
ADD THE COLUMN AX( = A1X - A2X)
CODE:
>>> x = csv.reader(open('E:/solarpotential analysis/iitborientation/trialcsv.csv','rb'))
>>> y = csv.writer(open('E:/solarpotential analysis/iitborientation/trial.csv','wb',buffering=0))
>>> for row in x:
a = float(row[0])
b = str(row[1])
c = float(row[2])
d = float(row[3])
e = float(row[4])
f = float(row[2] - row[5])
y.writerow([a,b,c,d,e,f])
it shows no error but not be updated in output file
You can do this by this way:
inputt=open("input.csv","r")
outputt=open("output.csv","w")
for line in inputt.readlines():
#print line.replace("\n","")
outputt.write(line.replace("\n","") + ";6column\n")
inputt.close()
outputt.close()

IndexError, but more likely I/O error

Unsure of why I am getting this error. I'm reading from a file called columns_unsorted.txt, then trying to write to columns_unsorted.txt. There error is on fan_on = string_j[1], saying list index out of range. Here's my code:
#!/usr/bin/python
import fileinput
import collections
# open document to record results into
j = open('./columns_unsorted.txt', 'r')
# note this is a file of rows of space-delimited date in the format <1384055277275353 0 0 0 1 0 0 0 0 22:47:57> on each row, the first term being unix times, the last human time, the middle binary indicating which machine event happened
# open document to read from
l = open('./columns_sorted.txt', 'w')
# CREATE ARRAY CALLED EVENTS
events = collections.deque()
i = 1
# FILL ARRAY WITH "FACTS" ROWS; SPLIT INTO FIELDS, CHANGE TYPES AS APPROPRIATE
for line in j: # columns_unsorted
line = line.rstrip('\n')
string_j = line.split(' ')
time = str(string_j[0])
fan_on = int(string_j[1])
fan_off = int(string_j[2])
heater_on = int(string_j[3])
heater_off = int(string_j[4])
space_on = int(string_j[5])
space_off = int(string_j[6])
pump_on = int(string_j[7])
pump_off = int(string_j[8])
event_time = str(string_j[9])
row = time, fan_on, fan_off, heater_on, heater_off, space_on, space_off, pump_on, pump_off, event_time
events.append(row)
You are missing the readlines function, no?
You have to do:
j = open('./columns_unsorted.txt', 'r')
l = j.readlines()
for line in l:
# what you want to do with each line
In the future, you should print some of your variables, just to be sure the code is working as you want it to, and to help you identifying problems.
(for example, if in your code you would print string_j you would see what kind of problem you have)
Problem was an inconsistent line in the data file. Forgive my haste in posting