New to python - trying to chose individual columns from transposed matrix - python-2.7

So presently code is as so:
table = []
for line in open("harrytest.csv") as f:
data = line.split(",")
table.append(data)
transposed = [[table[j][i] for j in range(len(table))] for i in range(len(table[0]))]
openings = transposed[1][1: - 1]
openings = [float(i) for i in openings]
mean = sum(openings)/len(openings)
print mean
minimum = min(openings)
print minimum
maximum = max(openings)
print maximum
range1 = maximum - minimum
print range1
This only prints one column of 7 for me, it also leaves out the bottom line. We are not allowed to import with csv module, use numpy, pandas. The only module allowed is os, sys, math & datetime.
How do I write the code so as to get median, first, last values for any column.

Change this line:
openings = transposed[1][1: - 1]
to this
openings = transposed[1][1:]
and the last row should appear. You calculations for mean, min, max and range seem correct.
For median you have to sort the row and select the one middle element or average of the two middle elements. First and last element is just row[0] and row[-1].

Related

how to solve concatenate issue with.cell()? row = row work, column = column gives error

I am looping through an excel sheet, looking for a specific name. When found, I print the position of the cell and the value.
I would like to find the position and value of a neighbouring cell, however I can't get .cell() to work by adding 2, indicating I would like the cell 2 columns away in the same row.
row= row works, but column= column gives error, and column + 2 gives error. Maybe this is due to me listing columns as 'ABCDEFGHIJ' earlier in my code? (For full code, see below)
print 'Cell position {} has value {}'.format(cell_name, currentSheet[cell_name].value)
print 'Cell position next door TEST {}'.format(currentSheet.cell(row=row, column=column +2))
Full code:
file = openpyxl.load_workbook('test6.xlsx', read_only = True)
allSheetNames = file.sheetnames
#print("All sheet names {}" .format(file.sheetnames))
for sheet in allSheetNames:
print('Current sheet name is {}'.format(sheet))
currentSheet = file[sheet]
for row in range(1, currentSheet.max_row + 1):
#print row
for column in 'ABCDEFGHIJ':
cell_name = '{}{}'.format(column,row)
if currentSheet[cell_name].value == 'sign_name':
print 'Cell position {} has value {}'.format(cell_name, currentSheet[cell_name].value)
print 'Cell position TEST {}'.format(currentSheet.cell(row=row, column=column +2))
I get this output:
Current sheet name is Sheet1
Current sheet name is Sheet2
Cell position D5 has value sign_name
and:
TypeError: cannot concatenate 'str' and 'int' objects
I get the same error if I try "column = column" as "column = column +2".
Why does row=row work, but column=column dosen't? And how to find the cell name of the cell to the right of my resulting D5 cell?
The reason row=row works and column=column doesn't is because your column value is a string (letter from A to J) while the column argument of a cell is expecting an int (A would be 1, B would be 2, Z would be 26, etc.)
There are a few changes I would make in order to more effectively iterate through the cells and find a neighbor. Firstly, OpenPyXl offers sheet.iter_rows(), which given no arguments, will provide a generator of all rows that are used in the sheet. So you can iterate with
for row in currentSheet.iter_rows():
for cell in row:
because each row is a generator of cells in that row.
Then in this new nested for loop, you can get the current column index with cell.column (D would give 4) and the cell to the right (increment by one column) would be currentSheet.cell(row=row, column=cell.column+1)
Note the difference between the two cell's: currentSheet.cell() is a request for a specific cell while cell.column+1 is the column index of the current cell incremented by 1.
Relevant OpenPyXl documentation:
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.cell.cell.html
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html

Copying Array elements to excel using python

I am trying to copy elements of array sequentially to excel .
Here is the code :
array = ['A','B','C','D','E']
print len(array)
for i in range(1,len(array)):
sheet2.cell(i,1).value = array[i]
#print cell
#sheet2.cell(i,1).value = cell
wb2.save(path2)
Expected :
Should sequentially write A,B,C,D,E to the rows in excel
Actual:
Starts writing from B,C,D,E
What am i missing. Something very simple
if you use-> for i in range(1,len(array)):
you'll miss the first element of array because it starts with 1. index
python arrays start with 0 index
use it like -> for i in range(len(array)): or for i in range(0,len(array)):
Python indices start with 0, excel indices apparently with 1. So simply do:
array = ['A','B','C','D','E']
print len(array)
for i in range(0,len(array)):
sheet2.cell(i+1,1).value = array[i]
print cell
sheet2.cell(i+1,1).value = cell
wb2.save(path2)
In python array starts with zero. This is the correct code:
array = ['A','B','C','D','E']
print len(array)
for i in range(0,len(array)):
sheet2.cell(i+1,1).value = array[i]
#print cell
#sheet2.cell(i+1,1).value = cell
wb2.save(path2)
Notice that the for loop starts from 0.

Using For loop on nested list

I'm using a nested list to hold data in a Cartesian coordinate type system.
The data is a list of categories which could be 0,1,2,3,4,5,255 (just 7 categories).
The data is held in a list formatted thus:
stack = [[0,1,0,0],
[2,1,0,0],
[1,1,1,3]]
Each list represents a row and each element of a row represents a data point.
I'm keen to hang on to this format because I am using it to generate images and thus far it has been extremely easy to use.
However, I have run into problems running the following code:
for j in range(len(stack)):
stack[j].append(255)
stack[j].insert(0, 255)
This is intended to iterate through each row adding a single element 255 to the start and end of each row. Unfortunately it adds 12 instances of 255 to both the start and end!
This makes no sense to me. Presumably I am missing something very trivial but I can't see what it might be. As far as I can tell it is related to the loop: if I write stack[0].append(255) outside of the loop it behaves normally.
The code is obviously part of a much larger script. The script runs multiple For loops, a couple of which are range(12) but which should have closed by the time this loop is called.
So - am I missing something trivial or is it more nefarious than that?
Edit: full code
step_size = 12, the code above is the part that inserts "right and left borders"
def classify(target_file, output_file):
import numpy
import cifar10_eval # want to hijack functions from the evaluation script
target_folder = "Binaries/" # finds target file in "Binaries"
destination_folder = "Binaries/Maps/" # destination for output file
# open the meta file to retrieve x,y dimensions
file = open(target_folder + target_file + "_meta" + ".txt", "r")
new_x = int(file.readline())
new_y = int(file.readline())
orig_x = int(file.readline())
orig_y = int(file.readline())
segment_dimension = int(file.readline())
step_size = int(file.readline())
file.close()
# run cifar10_eval and create predictions vector (formatted as a list)
predictions = cifar10_eval.map_interface(new_x * new_y)
del predictions[(new_x * new_y):] # get rid of excess predictions (that are an artefact of the fixed batch size)
print("# of predictions: " + str(len(predictions)))
# check that we are mapping the whole picture! (evaluation functions don't necessarily use the full data set)
if len(predictions) != new_x * new_y:
print("Error: number of predictions from cifar10_eval does not match metadata for this file")
return
# copy predictions to a nested list to make extraction of x/y data easy
# also eliminates need to keep metadata - x/y dimensions are stored via the shape of the output vector
stack = []
for j in range(new_y):
stack.append([])
for i in range(new_x):
stack[j].append(predictions[j*new_x + i])
predictions = None # clear the variable to free up memory
# iterate through map list and explode each category to cover more pixels
# assigns a step_size x step_size area to each classification input to achieve correspondance with original image
new_stack = []
for j in range(len(stack)):
row = stack[j]
new_row = []
for i in range(len(row)):
for a in range(step_size):
new_row.append(row[i])
for b in range(step_size):
new_stack.append(new_row)
stack = new_stack
new_stack = None
new_row = None # clear the variables to free up memory
# add a border to the image to indicate that some information has been lost
# border also ensures that map has 1-1 correspondance with original image which makes processing easier
# calculate border dimensions
top_and_left_thickness = int((segment_dimension - step_size) / 2)
right_thickness = int(top_and_left_thickness + (orig_x - (top_and_left_thickness * 2 + step_size * new_x)))
bottom_thickness = int(top_and_left_thickness + (orig_y - (top_and_left_thickness * 2 + step_size * new_y)))
print(top_and_left_thickness)
print(right_thickness)
print(bottom_thickness)
print(len(stack[0]))
# add the right then left borders
for j in range(len(stack)):
for b in range(right_thickness):
stack[j].append(255)
for b in range(top_and_left_thickness):
stack[j].insert(0, 255)
print(stack[0])
print(len(stack[0]))
# add the top and bottom borders
row = []
for i in range(len(stack[0])):
row.append(255) # create a blank row
for b in range(top_and_left_thickness):
stack.insert(0, row) # append the blank row to the top x many times
for b in range(bottom_thickness):
stack.append(row) # append the blank row to the bottom of the map
# we have our final output
# repackage this as a numpy array and save for later use
output = numpy.asarray(stack,numpy.uint8)
numpy.save(destination_folder + output_file + ".npy", output)
print("Category mapping complete, map saved as numpy pickle: " + output_file + ".npy")

Deleting duplicate x values and their corresponding y values

I am working with a list of points in python 2.7 and running some interpolations on the data. My list has over 5000 points and I have some repeating "x" values within my list. These repeating "x" values have different corresponding "y" values. I want to get rid of these repeating points so that my interpolation function will work, because if there are repeating "x" values with different "y" values it runs an error because it does not satisfy the criteria of a function. Here is a simple example of what I am trying to do:
Input:
x = [1,1,3,4,5]
y = [10,20,30,40,50]
Output:
xy = [(1,10),(3,30),(4,40),(5,50)]
The interpolation function I am using is InterpolatedUnivariateSpline(x, y)
have a variable where you store the previous X value, if it is the same as the current value then skip the current value.
For example (pseudo code, you do the python),
int previousX = -1
foreach X
{
if(x == previousX)
{/*skip*/}
else
{
InterpolatedUnivariateSpline(x, y)
previousX = x /*store the x value that will be "previous" in next iteration
}
}
i am assuming you are already iterating so you dont need the actualy python code.
A bit late but if anyone is interested, here's a solution with numpy and pandas:
import pandas as pd
import numpy as np
x = [1,1,3,4,5]
y = [10,20,30,40,50]
#convert list into numpy arrays:
array_x, array_y = np.array(x), np.array(y)
# sort x and y by x value
order = np.argsort(array_x)
xsort, ysort = array_x[order], array_y[order]
#create a dataframe and add 2 columns for your x and y data:
df = pd.DataFrame()
df['xsort'] = xsort
df['ysort'] = ysort
#create new dataframe (mean) with no duplicate x values and corresponding mean values in all other cols:
mean = df.groupby('xsort').mean()
df_x = mean.index
df_y = mean['ysort']
# poly1d to create a polynomial line from coefficient inputs:
trend = np.polyfit(df_x, df_y, 14)
trendpoly = np.poly1d(trend)
# plot polyfit line:
plt.plot(df_x, trendpoly(df_x), linestyle=':', dashes=(6, 5), linewidth='0.8',
color=colour, zorder=9, figure=[name of figure])
Also, if you just use argsort() on the values in order of x, the interpolation should work even without the having to delete the duplicate x values. Trying on my own dataset:
polyfit on its own
sorting data in order of x first, then polyfit
sorting data, delete duplicates, then polyfit
... I get the same result twice

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...