List index out of range when reading a file - list

I am opening a file and trying to read the 3rd value on each line. Here is my code
myfile = 'dummy2.pepmasses'
fileObj = open(myfile, 'r')
line = fileObj.readline()
while line:
line = fileObj.readline()
linesplit = line.split()
weight = linesplit[2]
print(weight)
fileObj.close
This is resulting the third value being correctly displayed however there is an index error at the bottom but I'm not sure why as I'm not specifying a range of values to read, but rather just to read everything. I believe the issue is that when I read the file there is a blank [] at the bottom, although there are no blank lines on the actual file so I don't understand what is happening.
Any ideas appreciated, thanks.
The end of my file is
STE50,YCL032W 36 1262.6920 0 0 QQGLHPAIMLR
STE50,YCL032W 37 174.1117 0 0 R
STE50,YCL032W 38 174.1117 0 0 R
STE50,YCL032W 39 2081.8783 0 0 GDFEEVAMMNGSDNVTPGGR
STE50,YCL032W 40 131.0947 0 0 L*
The error generated is
174.1117
2081.8783
131.0947
Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/Test/Test.py", line 12, in <module>
weight = linesplit[2]
IndexError: list index out of range

You should check whether linesplit has at least 3 values in it and only print the weight in that case.

Related

What is my problem with join mapper code?

I'm trying to practice joining data using mapreduce, but when i run this line
cat join1_File*.txt | ./join1_mapper.py | sort | ./join1_reducer.py
it displays this erorr :
Traceback (most recent call last):
File "./join1_mapper.py", line 24, in
value_in = key_value[1] #value is 2nd item
IndexError: list index out of range
Apr-04 able 13 n-01 5
Dec-15 able 100 n-01 5
Feb-02 about 3 11
Mar-03 about 8 11
Feb-22 actor 3 22
Feb-23 burger 5 15
Mar-08 burger 2 15
i expect the output to be like that :
Apr-04 able 13 n-01 5
Dec-15 able 100 n-01 5
Feb-02 about 3 11
Mar-03 about 8 11
Feb-22 actor 3 22
Feb-23 burger 5 15
Mar-08 burger 2 15
This is my join1_mapper.py code:
`for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split(",") #split line, into key and value, returns a list
key_in = key_value[0].split(" ") #key is first item in list
value_in = key_value[1] #value is 2nd item
#print key_in
if len(key_in)>=2: #if this entry has <date word> in key
date = key_in[0] #now get date from key field
word = key_in[1]
value_out = date+" "+value_in #concatenate date, blank, and value_in
print( '%s\t%s' % (word, value_out) ) #print a string, tab, and string
else: #key is only <word> so just pass it through
print( '%s\t%s' % (key_in[0], value_in) ) #print a string tab and string
#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value`

Reading in TSP file Python

I need to figure out how to read in this data of the filename 'berlin52.tsp'
This is the format I'm using
NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION : 52
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
8 525.0 1000.0
9 580.0 1175.0
10 650.0 1130.0
And this is my current code
# Open input file
infile = open('berlin52.tsp', 'r')
# Read instance header
Name = infile.readline().strip().split()[1] # NAME
FileType = infile.readline().strip().split()[1] # TYPE
Comment = infile.readline().strip().split()[1] # COMMENT
Dimension = infile.readline().strip().split()[1] # DIMENSION
EdgeWeightType = infile.readline().strip().split()[1] # EDGE_WEIGHT_TYPE
infile.readline()
# Read node list
nodelist = []
N = int(intDimension)
for i in range(0, int(intDimension)):
x,y = infile.readline().strip().split()[1:]
nodelist.append([int(x), int(y)])
# Close input file
infile.close()
The code should read in the file, output out a list of tours with the values "1, 2, 3..." and more while the x and y values are stored to be calculated for distances. It can collect the headers, at least. The problem arises when creating a list of nodes.
This is the error I get though
ValueError: invalid literal for int() with base 10: '565.0'
What am I doing wrong here?
This is a file in TSPLIB format. To load it in python, take a look at the python package tsplib95, available through PyPi or on Github
Documentation is available on https://tsplib95.readthedocs.io/
You can convert the TSPLIB file to a networkx graph and retrieve the necessary information from there.
You are feeding the string "565.0" into nodelist.append([int(x), int(y)]).
It is telling you it doesn't like that because that string is not an integer. The .0 at the end makes it a float.
So if you change that to nodelist.append([float(x), float(y)]), as just one possible solution, then you'll see that your problem goes away.
Alternatively, you can try removing or separating the '.0' from your string input.
There are two problem with the code above.I have run the code and found the following problem in lines below:
Dimension = infile.readline().strip().split()[1]
This line should be like this
`Dimension = infile.readline().strip().split()[2]`
instead of 1 it will be 2 because for 1 Dimension = : and for 2 Dimension = 52.
Both are of string type.
Second problem is with line
N = int(intDimension)
It will be
N = int(Dimension)
And lastly in line
for i in range(0, int(intDimension)):
Just simply use
for i in range(0, N):
Now everything will be alright I think.
nodelist.append([int(x), int(y)])
int(x)
function int() cant convert x(string(565.0)) to int because of "."
add
x=x[:len(x)-2]
y=y[:len(y)-2]
to remove ".0"

Assinging 1 character from 1 list to another python

Hi i am making a decryption machine for my school project but i cant get it to work can you guys help me out?
Thanks already.
the error is: line 17, IndexError: list index out of range
The length of zin = 86 just so you know
this is what is in the file i need to decrypt: KEIGO N JIDOUBANEUOFIDNEIESUN IRAEI ESTIGIVNKMUEEER RDONAEOIW ENEZAEE NAML VN NILLRA
with open('something.txt', 'r') as fhandle:
key = 3
#reading the file
zin = list(fhandle.readline())
#setting up solution to which we will output
solution = list(" ")*86
solution[0] = zin[0]
#while loop in which we use the key to decrypt the message
i = 1
while i < len(zin):
solution[i] = zin[key] #this is where i get the error
i += 1
key += key
if i > 86:
break
print(solution)
Since you are accessing zin[key], you need to verify length of zin is at least key+1.

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...

JasperReports: Converting String into array and populating List with it

What I have is this String [125, 154, 749, 215, 785, 1556, 3214, 7985]
(string can have anything from 1 to 15 ID's in it and the reason it is a string and not a List is that, its being sent through a URL)
I need to populate a List called campusAndFaculty with it
I am using iReport 5.0.0
I've tried entering this in the campusAndFaculty default value Expression field
Array.asList(($P{campusAndFacultyString}.substring( 1, ($P{campusAndFacultyString}.length() -2 ))).split("\\s*,\\s*"))
But it does not populate the campusAndFaculty List
Any idea how I can populate the List campusAndFaculty using that String ("campusAndFacultyString")?
======================
UPDATE
I have these variables in iReport (5.0.0)
String campusAndFacultyFromBack = "[111, 125, 126, 4587, 1235, 1259]"
String noBrackets = $P{campusAndFacultyFromBack}.substring(1 ($P{campusAndFacultyFromBack}.length() -1 ))
List campusAndFacultyVar = java.util.Arrays.asList(($V{noBrackets}).split("\\s*,\\s*"))
When I print campusAndFacultyVar It returns "[111, 125, 126, 4587, 1235, 1259]"
but when I use it in a Filter I get the "Cannot evaluate the following expression: org_organisation.org_be_id in null"
This works for me:
String something = "[125, 154, 749, 215, 785, 1556, 3214, 7985]";
Arrays.asList((something.substring(1, (something.length() -1 ))).split("\\s*,\\s*"));
Which means you can do this in iReport:
java.util.Arrays.asList(($P{campusAndFacultyString}.substring(1, (something.length() -1 ))).split("\\s*,\\s*"));
Differences with your snippet:
It's Arrays, not Array
You should take 1, not 2 from the length
Fully qualified reference to Arrays class (which may or may not matter depending on how your iReport is configured)