Assinging 1 character from 1 list to another python - list

Hi i am making a decryption machine for my school project but i cant get it to work can you guys help me out?
Thanks already.
the error is: line 17, IndexError: list index out of range
The length of zin = 86 just so you know
this is what is in the file i need to decrypt: KEIGO N JIDOUBANEUOFIDNEIESUN IRAEI ESTIGIVNKMUEEER RDONAEOIW ENEZAEE NAML VN NILLRA
with open('something.txt', 'r') as fhandle:
key = 3
#reading the file
zin = list(fhandle.readline())
#setting up solution to which we will output
solution = list(" ")*86
solution[0] = zin[0]
#while loop in which we use the key to decrypt the message
i = 1
while i < len(zin):
solution[i] = zin[key] #this is where i get the error
i += 1
key += key
if i > 86:
break
print(solution)

Since you are accessing zin[key], you need to verify length of zin is at least key+1.

Related

Reading in TSP file Python

I need to figure out how to read in this data of the filename 'berlin52.tsp'
This is the format I'm using
NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION : 52
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
8 525.0 1000.0
9 580.0 1175.0
10 650.0 1130.0
And this is my current code
# Open input file
infile = open('berlin52.tsp', 'r')
# Read instance header
Name = infile.readline().strip().split()[1] # NAME
FileType = infile.readline().strip().split()[1] # TYPE
Comment = infile.readline().strip().split()[1] # COMMENT
Dimension = infile.readline().strip().split()[1] # DIMENSION
EdgeWeightType = infile.readline().strip().split()[1] # EDGE_WEIGHT_TYPE
infile.readline()
# Read node list
nodelist = []
N = int(intDimension)
for i in range(0, int(intDimension)):
x,y = infile.readline().strip().split()[1:]
nodelist.append([int(x), int(y)])
# Close input file
infile.close()
The code should read in the file, output out a list of tours with the values "1, 2, 3..." and more while the x and y values are stored to be calculated for distances. It can collect the headers, at least. The problem arises when creating a list of nodes.
This is the error I get though
ValueError: invalid literal for int() with base 10: '565.0'
What am I doing wrong here?
This is a file in TSPLIB format. To load it in python, take a look at the python package tsplib95, available through PyPi or on Github
Documentation is available on https://tsplib95.readthedocs.io/
You can convert the TSPLIB file to a networkx graph and retrieve the necessary information from there.
You are feeding the string "565.0" into nodelist.append([int(x), int(y)]).
It is telling you it doesn't like that because that string is not an integer. The .0 at the end makes it a float.
So if you change that to nodelist.append([float(x), float(y)]), as just one possible solution, then you'll see that your problem goes away.
Alternatively, you can try removing or separating the '.0' from your string input.
There are two problem with the code above.I have run the code and found the following problem in lines below:
Dimension = infile.readline().strip().split()[1]
This line should be like this
`Dimension = infile.readline().strip().split()[2]`
instead of 1 it will be 2 because for 1 Dimension = : and for 2 Dimension = 52.
Both are of string type.
Second problem is with line
N = int(intDimension)
It will be
N = int(Dimension)
And lastly in line
for i in range(0, int(intDimension)):
Just simply use
for i in range(0, N):
Now everything will be alright I think.
nodelist.append([int(x), int(y)])
int(x)
function int() cant convert x(string(565.0)) to int because of "."
add
x=x[:len(x)-2]
y=y[:len(y)-2]
to remove ".0"

Python, build dict from a list with certain items as keys and items in between as values

I have a text file broken into a list of strings in the format:
['DATE','NAME', 'RT','1A','541','09947','199407',552','09949','BOON','101C','SMITH','00321','1553678','1851243','561','559','004789',1749201',ANDERSON']
I would like to create a dict using the items where item[0:-1].isdigit() and item[-1].isalpha(), so in the example above this would be 1A and 101C. I then want to add only the items that are int(item.isdigit()) > 100000 where the items fit this criteria are assembled into a new list via a for loop (or maybe a while loop) until the loop hits the next key value.
The result would be dct = {'1A': ['199407'], '101C':['1553678','1851243','1749201']}
I'm currently geting an index error despite putting in a while condition to break once the iterations reach the length of the items in the keys list. Before getting this error, I was indexing the values differently and getting an empty dict. I'm expecting to get another empty dict once the index error is fixed.
Here's my code:
# create a list of the dictionary keys to find values in 1A format
# in order to avoid key error when building dict, do not add duplicate
# values to list. Needs to be a list andd not tuple so it can be indexed
for line in lines:
if line[0:-1].isdigit() and line[-1].isalpha() and line not in keys:
keys.append(line)
print str(keys) + " " + str(len(keys))
# build a list of values for each item in keys. Should find the first
# key and check if a converted string to number is > 100000. If it is
# the value is appended to the valLst. If the next key is encountered
# the nested loop breaks and valLst is added to the current key. The
# primary loop moves to the next key while the nested loop should only
# consider items between the current primary iterable and the next.
passes = 0
while passes <=len(keys): # exit loop before index error
for key in keys:
passes += 1
curKey = keys.index(key) # current primary iterable position
nextKey = curKey + 1 # next primary iterable position
print "Passes: " + str(passes)
valLst = [] # empty list for dct values--resets after nested loop break
for line in lines: #iterate through text
if line == keys[nextKey]: # the next key value is encountered in text
break
dict[key] = valLst # valList added to current dict key
curLine = lines.index(line) # start at current key value found in text
if curLine == key: # find current key in text
nextLine = curLine + 1 # get index of next value after current key in text
val = lines[nextLine] # next text value
if val.isdigit(): #append value to valLst if it is > 100000
num = int(val)
if num > 100000:
valLst.append(num)
Here is my current error:
Traceback (most recent call last):
File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
debugger.run(codeObject, __main__.__dict__, start_stepping=0)
File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 60, in run
_GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 654, in run
exec cmd in globals, locals
File "C:\Users\user\Desktop\Scripts\PDF_Extractor.py", line 1, in <module>
from cStringIO import StringIO
IndexError: list index out of range
I've been looking into list comprehensions but haven't grasped them well enough to apply one in this situation. Am I going in the right direction with the above code or is there a list comprehension approach I could take that would be something like:
valLst = {key for keys in lines for line in line if line == key and int(line.isdigit()) > 100000 valLst.append(line)}
keys = ['DATE', 'NAME', 'RT', '1A', '541', '09947', '199407', '552', '09949', 'BOON', \
'101C', 'SMITH', '00321', '1553678', '1851243', '561', '559', '004789', '1749201', 'ANDERSON']
from collections import OrderedDict
valList = OrderedDict()
for k in keys:
if len(k) > 0:
if k[0].isdigit() and k[-1].isalpha() and ' ' not in k and k not in valList.keys():
valList[k] = []
try:
if int(k) > 100000:
try:
valList[valList.keys()[-1]].append(k)
except ValueError:
valList[valList.keys()[-1]] = k
except ValueError:
continue
print valList
output:
OrderedDict([('1Y', ['15538870', '15922112', '16037395', '16069918', '16116102', '16292996', '16658378', '16700710', '16783588', '16832641', '16944735', '16994444', '313132', '12722185', '11415965', '10966593', '9983979', '8573715', '11733178', '552204', '3150537', '552422', '8013132', '9298415', '8742458', '8626402', '4708497', '11687768', '12192686', '734061', '734171', '9896029', '8636757', '2662814', '10407886', '11730755', '4504371', '9187313', '2362896', '7891338', '3519990', '12293652', '9226220', '5984854', '3295145', '1068579', '2031247', '11242586', '8408050', '8440673', '2752194', '5843333', '1740045', '2584772']), ('2A', ['16174735', '16330036', '16334662', '16345573', '16350100', '16376985', '16397823', '16411821', '16435182', '16443451', '16449626', '16574945', '16590154', '16597759', '16615837', '16649016', '16756921', '16762759', '16795828', '16879043', '16887968', '16900090', '16900428', '16902522', '16910127']), ('3A', ['16320336', '16328934', '16331684', '16346347', '16360892', '16370045', '16407413', '16408287', '16444990', '16446211', '16453706', '16467695', '16468032', '11697249', '11843287', '1339389', '2435865', '10001948', '4760965', '2480063', '13588296', '1813233', '11741885', '8972714', '9688478', '16070245']), ('3Y', ['13226120', '13232404', '13233834', '13235601', '13238679', '13241985', '13247504', '13249817', '13262823', '13268442', '13269981', '13270318', '13272413', '13282003', '13284535', '13288943', '13294453'])])
or inspect each dictionary one at a time to confirm we get the expected dictionary keys and items:
for d in valList.items():
print d
OrderedDict([
('1Y', ['15538870', '15922112', '16037395', '16069918', '16116102', '16292996', '16658378', '16700710', '16783588', '16832641', '16944735', '16994444', '313132', '12722185', '11415965', '10966593', '9983979', '8573715', '11733178', '552204', '3150537', '552422', '8013132', '9298415', '8742458', '8626402', '4708497', '11687768', '12192686', '734061', '734171', '9896029', '8636757', '2662814', '10407886', '11730755', '4504371', '9187313', '2362896', '7891338', '3519990', '12293652', '9226220', '5984854', '3295145', '1068579', '2031247', '11242586', '8408050', '8440673', '2752194', '5843333', '1740045', '2584772']), ('2A', ['16174735', '16330036', '16334662', '16345573', '16350100', '16376985', '16397823', '16411821', '16435182', '16443451', '16449626', '16574945', '16590154', '16597759', '16615837', '16649016', '16756921', '16762759', '16795828', '16879043', '16887968', '16900090', '16900428', '16902522', '16910127']), ('3A', ['16320336', '16328934', '16331684', '16346347', '16360892', '16370045', '16407413', '16408287', '16444990', '16446211', '16453706', '16467695', '16468032', '11697249', '11843287', '1339389', '2435865', '10001948', '4760965', '2480063', '13588296', '1813233', '11741885', '8972714', '9688478', '16070245']), ('3Y', ['13226120', '13232404', '13233834', '13235601', '13238679', '13241985', '13247504', '13249817', '13262823', '13268442', '13269981', '13270318', '13272413', '13282003', '13284535', '13288943', '13294453'])])
('1Y', ['15538870', '15922112', '16037395', '16069918', '16116102', '16292996', '16658378', '16700710', '16783588', '16832641', '16944735', '16994444', '313132', '12722185', '11415965', '10966593', '9983979', '8573715', '11733178', '552204', '3150537', '552422', '8013132', '9298415', '8742458', '8626402', '4708497', '11687768', '12192686', '734061', '734171', '9896029', '8636757', '2662814', '10407886', '11730755', '4504371', '9187313', '2362896', '7891338', '3519990', '12293652', '9226220', '5984854', '3295145', '1068579', '2031247', '11242586', '8408050', '8440673', '2752194', '5843333', '1740045', '2584772'])
('2A', ['16174735', '16330036', '16334662', '16345573', '16350100', '16376985', '16397823', '16411821', '16435182', '16443451', '16449626', '16574945', '16590154', '16597759', '16615837', '16649016', '16756921', '16762759', '16795828', '16879043', '16887968', '16900090', '16900428', '16902522', '16910127'])
('3A', ['16320336', '16328934', '16331684', '16346347', '16360892', '16370045', '16407413', '16408287', '16444990', '16446211', '16453706', '16467695', '16468032', '11697249', '11843287', '1339389', '2435865', '10001948', '4760965', '2480063', '13588296', '1813233', '11741885', '8972714', '9688478', '16070245'])
('3Y', ['13226120', '13232404', '13233834', '13235601', '13238679', '13241985', '13247504', '13249817', '13262823', '13268442', '13269981', '13270318', '13272413', '13282003', '13284535', '13288943', '13294453'])

Extracting Specific Columns from Multiple Files & Writing to File Python

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change)
000001 000001 ZZ 1:1 01 01 NOTEST 0 0 0 0 1 1 no
I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:
gene name locus log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change)
ZZ 1:1 0 0 0 0
all the log2(fold_change) are obtain from the tenth column from each of the seven files
What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work
dicti = defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
reader = csv.reader((f), delimiter='\t')
for row in reader:
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output.txt", "w") as out:
out.write("gene name" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
for k,v in dicti:
out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])+"\n")
So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:
"".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])
I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:
Here is what I mean for instance,
gene name locus log2(fold change) from file 1 .... log2(fold change) from file7
ZZ 1:3 0
0
because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.
Here is my first approach:
import glob
import numpy as np
with open('output.txt', 'w') as out:
fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
# Title row:
titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
out.write('\t'.join(titles) + '\n')
# Data row:
data = []
for idx, fn in enumerate(fns):
file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
if idx == 0:
data.extend([file[0], file[1]])
data.append(file[2])
out.write('\t'.join(data))
Content of the created file output.txt (Note: I created just three files for testing):
gene_name locus 1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ 1:1 0 0 0
I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.
import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
for row in f:
r = re.compile(r'([^\s]*)\s*')
row = r.findall(row.strip())[:-1]
print row
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
def main():
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output", "w") as out:
data = ("genename" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
r = re.compile(r'([^\s]*)\s*')
data = r.findall(data.strip())[:-1]
out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30} {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
for key in dicti:
print 'locus = ' + str(dicti[key][1])
data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([ len(z[0][0:3]) * "\t" + z[1][9] for z in dicti[key] ])+"\n")
data = r.findall(data.strip())[:-1]
out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
if __name__ == '__main__':
main()
and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted.
Thanks
gene name locus 1.t-log2(fold_change) 2.t-log2(fold_change) 3.t-log2(fold_change) 4.t-log2(fold_change) 5.t-log2(fold_change) 6.t-log2(fold_change) 7.t-log2(fold_change)
ZZ 1:1 0 0 0 0 0 0 0
Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.
import csv
import os
import glob
# Your folder location where the input files are saved.
name_of_folder = '...'
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))
with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
headers_read = False
for input_file in input_files:
if input_file == os.path.join(name_of_folder, output_filename):
# If the output file is in the list of input files, ignore it.
continue
with open(input_file, 'r') as fin:
reader = csv.reader(fin)
if not headers_read:
# Read column headers just once
headers = reader.next()[0].split()
headers = headers[2:4] + [headers[9]]
file_out.write("\t".join(headers + ['\n'])) # Zero based indexing.
headers_read = True
else:
_ = reader.next() # Ignore header row.
for line in reader:
if line: # Ignore blank lines.
line_out = line[0].split()
file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))
>>> !cat output.txt
gene locus log2(fold_change)
ZZ 1:1 0
ZZ 1:1 0

Variable within a number

This code ask for a message and a value to the user and then it modifies it with the given value. The problem is that I want the ASCII codes to not go over 126 or under 33, I tried to do so in the highlighted part of the code but when the value of the ASCII code gets over 126 the code returns me nothing for some reason.
loop = True
def start():
final_word = []
word_to_crypt = str(raw_input("Type a word: "))
crypt_value = int(raw_input("Choose a number to cript yout message with: "))
ascii_code = 0
n = 0
m = len(word_to_crypt)
m = int(m - 1)
while n <= m:
ascii_code = ord(word_to_crypt[n])
ascii_code += crypt_value
############# highlight #############
if 33 > ascii_code > 126:
ascii_code = (ascii_code%94)+33
############# highlight #############
final_word.append(chr(ascii_code))
n += 1
print 'Your crypted word is: ' + ''.join(final_word)
while loop:
start()
Sorry if it's not formatted well or for any mistakes in my explanation but I'm on my phone and I'm not native
Solved thank you very much this site and this community is helping me a lot!
There is no number that is greater than 126 and less than 33 at the same time. It should be:
if 33 < ascii_code < 126:
Edit:
If you want the reversed case, you will have to do it separately:
if ascii_code < 33 or ascii_code > 126:
Or you can just use the in operator and a list:
if ascii_code not in [33,126]:

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...