Fast indexing of symmetrical matrix - fortran

In a large code written in Fortran08 for calculating thermodynamic equilibria and phase diagrams I use many symmetric matrices which I store as 1D arrays and index using a small function
integer function ixsym(i,j)
if(i.gt.j) then
ixsym=j+i*(i-1)/2
else
ixsym=i+j*(j-1)/2
endif
return
end
This works perfectly but after improving the speed of various other parts of the code this routine now takes 15-20% of the calculation time (it is used very often). I assume there are various ways of speeding this up but I do not know C or other way to replace this function so I am looking for help. I use gfortran but the replacement has to be portable.
Bo Sundman

The only thing you might consider is to get rid of the branching in that function:
The minimum and maximum of two numbers can be computed as:
max = (a+b + abs(a-b))/2
min = (a+b - abs(a-b))/2 = a+b - max
So you can use this as
integer function ixsym(i,j)
integer :: p, q
q = i+j; p = (q + abs(i-j))/2; q = q - p
ixsym = q + (p*(p-1))/2
return
end
which you can further reduce as
integer function ixsym(i,j)
integer :: p
ixsym = i+j; p = (ixsym + abs(i-j))/2;
ixsym = ixsym + (p*(p-3))/2
return
end

Fortran compilers used to have optimization on par or better than C compilers.
So I would not expect a gain just by switching language and rather focus on algorithmic improvements.
How about replacing the calculation of the index tranformation by a lookup table?
Do you have the memory to store the ixsym values for given i and j indices?
Yes it counters your memory for cpu trade-off, but if you have many matrices this extra one might help.
Is it really necessary to calculate the transformation at all times? E.g. if you iterate over elements: ixsym(i, j+1) = ixsym(i, j) + 1, if i < j.
Another idea, though hardware specific, might be to order your data differently, so that it stays within cache areas of the CPU. (Link)
About your index transformation:
I initially thought you used some variation of the Cantor pairing function to enumerate your symmetric 2D array. I asked my friend Ruby to plot a few pairs and she told me:
(0, 0) -> 0 (0, 1) -> 0 (0, 2) -> 1 (0, 3) -> 3 (0, 4) -> 6
(1, 0) -> 0 (1, 1) -> 1 (1, 2) -> 2 (1, 3) -> 4 (1, 4) -> 7
(2, 0) -> 1 (2, 1) -> 2 (2, 2) -> 3 (2, 3) -> 5 (2, 4) -> 8
(3, 0) -> 3 (3, 1) -> 4 (3, 2) -> 5 (3, 3) -> 6 (3, 4) -> 9
(4, 0) -> 6 (4, 1) -> 7 (4, 2) -> 8 (4, 3) -> 9 (4, 4) -> 10
I would have expected only two occurences of a calculated index, but I see three for some pairs. Is this intended?
Update:
It was the index start, as fellow user Jean-Claude Arbaut pointed out in his comment.
Here is Ruby's answer with indices starting at 1:
(1, 1) -> 1 (1, 2) -> 2 (1, 3) -> 4 (1, 4) -> 7 (1, 5) -> 11
(2, 1) -> 2 (2, 2) -> 3 (2, 3) -> 5 (2, 4) -> 8 (2, 5) -> 12
(3, 1) -> 4 (3, 2) -> 5 (3, 3) -> 6 (3, 4) -> 9 (3, 5) -> 13
(4, 1) -> 7 (4, 2) -> 8 (4, 3) -> 9 (4, 4) -> 10 (4, 5) -> 14
(5, 1) -> 11 (5, 2) -> 12 (5, 3) -> 13 (5, 4) -> 14 (5, 5) -> 15

Related

how can i divide a list as subsets which are sum upto given number(non repeat)?

from given list of numbers
nums=[4,3,2,3,5,2,1]
from itertools import combinations
nums=[4,3,2,3,5,2,1]
li=[]
for i in range(1,len(nums)):
comb=combinations(nums,i)
for j in comb:
if sum(j)==5:
li.append(j)
print(li)
and output is
[(5,), (4, 1), (3, 2), (3, 2), (2, 3), (3, 2), (2, 2, 1)]
I am able to find the subsets but the elements seem to be repeated
so interested in non-repeating elements
I want the list of subsets that gives sum equal to 5
(without repetition)
example: [(5), (1, 4), (2,3), (2,3)]
If you change the loop slightly so that used numbers are removed from the list, they aren't reused in another sum, e. g.
i = 1
while i <= len(nums):
comb = combinations(nums, i)
for j in comb:
if sum(j) == 5:
li.append(j)
for n in j: nums.remove(n)
break
else: i += 1 # increment only if nothing found

How to calculate distances from coordinates stored in lists

So far I managed to calculate the distances between an Point P(x,y) and a multitude of points stored in a list l = [(x1,y1), (x2,y2), (x3,y3), ...) Here is the code :
import math
import pprint
l = [(1,2), (2,3), (4,5)]
p = (3,3)
dists = [math.sqrt((p[0]-l0)**2 + (p[1]-l1)**2) for l0, l1 in l]
pprint.pprint(dists)
Output :
[2.23606797749979, 1.0, 2.23606797749979]
Now I want to calculate the distances from multitude points in a new list to the points in the list l.
I haven't found a solution yet, so does anyone have an idea how this could be done?
Here is a possible solution:
from math import sqrt
def distance(p1, p2):
return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
lst1 = [(1,2), (2,3), (4,5)]
lst2 = [(6,7), (8,9), (10,11)]
for p1 in lst1:
for p2 in lst2:
d = distance(p1, p2)
print(f'Distance between {p1} and {p2}: {d}')
Output:
Distance between (1, 2) and (6, 7): 7.0710678118654755
Distance between (1, 2) and (8, 9): 9.899494936611665
Distance between (1, 2) and (10, 11): 12.727922061357855
Distance between (2, 3) and (6, 7): 5.656854249492381
Distance between (2, 3) and (8, 9): 8.48528137423857
Distance between (2, 3) and (10, 11): 11.313708498984761
Distance between (4, 5) and (6, 7): 2.8284271247461903
Distance between (4, 5) and (8, 9): 5.656854249492381
Distance between (4, 5) and (10, 11): 8.48528137423857

how the TFIDF values are transformed

I am new to NLP, please clarify on how the TFIDF values are transformed using fit_transform.
Below formula for calculating the IDF is working fine,
log (total number of documents + 1 / number of terms occurrence + 1) + 1
EG: IDF value for the term "This" in the document 1("this is a string" is 1.91629073
After applying fit_transform, values for all the terms are changed, what is the formula\logic used for the transformation
TFID = TF * IDF
EG: TFIDF value for the term "This" in the document 1 ("this is a string") is 0.61366674
How this value is arrived, 0.61366674?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
d = pd.Series(['This is a string','This is another string',
'TFIDF Computation Calculation','TFIDF is the product of TF and IDF'])
df = pd.DataFrame(d)
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df[0])
print (tfidf_vectorizer.idf_)
#output
#[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355 1.91629073
#1.91629073 1.51082562 1.91629073 1.51082562 1.91629073 1.51082562]
##-------------------------------------------------
##how the above values are getting transformed here
##-------------------------------------------------
print (tfidf.toarray())
#[[0. 0. 0. 0. 0. 0.49681612 0.
#0. 0.61366674 0. 0. 0. 0.61366674]
# [0. 0.61422608 0. 0. 0. 0.39205255
# 0. 0. 0.4842629 0. 0. 0. 0.4842629 ]
# [0. 0. 0.61761437 0.61761437 0. 0.
# 0. 0. 0. 0. 0.48693426 0. 0. ]
# [0.37718389 0. 0. 0. 0.37718389 0.24075159
# 0.37718389 0.37718389 0. 0.37718389 0.29737611 0.37718389 0. ]]
It's normed TF-IDF vectors because by default norm='l2' according to the documentation. So in the output of tfidf.toarray() each element on level 0 / row of the array represents a document and each element of level 1 / column represents a unique word with the sum of squares of vector elements for each document being equal to 1, which you can check by printing print([sum([word ** 2 for word in doc]) for doc in tfidf.toarray()]).
norm : ‘l1’, ‘l2’ or None, optional (default=’l2’)
Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two
vectors is their dot product when l2 norm has been applied. * ‘l1’:
Sum of absolute values of vector elements is 1. See
preprocessing.normalize
print(tfidf) #the same values you find in tfidf.toarray() but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) normed TF-IDF value
(0, 12) 0.6136667440107333 #1st word in 1st sentence: 'This'
(0, 5) 0.4968161174826459 #'is'
(0, 8) 0.6136667440107333 #'string', see that word 'a' is missing
(1, 12) 0.48426290003607125 #'This'
(1, 5) 0.3920525532545391 #'is'
(1, 8) 0.48426290003607125 #'string'
(1, 1) 0.6142260844216119 #'another'
(2, 10) 0.48693426407352264 #'TFIDF'
(2, 3) 0.6176143709756019 #'Computation'
(2, 2) 0.6176143709756019 #'Calculation'
(3, 5) 0.2407515909314943 #'is'
(3, 10) 0.2973761110467491 #'TFIDF'
(3, 11) 0.37718388973255157 #'the'
(3, 7) 0.37718388973255157 #'product'
(3, 6) 0.37718388973255157 #'of'
(3, 9) 0.37718388973255157 #'TF'
(3, 0) 0.37718388973255157 #'and'
(3, 4) 0.37718388973255157 #'IDF'
Because it's normed TF-IDF values the sum of squares of vector elements will be qual to 1. E.g. for the first document at index 0, the sum of squares of vector elements will be equal to 1: sum([0.6136667440107333 ** 2, 0.4968161174826459 ** 2, 0.6136667440107333 ** 2])
You can turn off this transformation by setting norm=None.
print(TfidfVectorizer(norm=None).fit_transform(df[0])) #the same values you find in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray(), but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) TF-IDF value
(0, 12) 1.5108256237659907 #1st word in 1st sentence: 'This'
(0, 5) 1.2231435513142097 #'is'
(0, 8) 1.5108256237659907 #'string', see that word 'a' is missing
(1, 12) 1.5108256237659907 #'This'
(1, 5) 1.2231435513142097 #'is'
(1, 8) 1.5108256237659907 #'string'
(1, 1) 1.916290731874155 #'another'
(2, 10) 1.5108256237659907 #'TFIDF'
(2, 3) 1.916290731874155 #'Computation'
(2, 2) 1.916290731874155 #'Calculation'
(3, 5) 1.2231435513142097 #'is'
(3, 10) 1.5108256237659907 #'TFIDF'
(3, 11) 1.916290731874155 #'the'
(3, 7) 1.916290731874155 #'product'
(3, 6) 1.916290731874155 #'of'
(3, 9) 1.916290731874155 #'TF'
(3, 0) 1.916290731874155 #'and'
(3, 4) 1.916290731874155 #'IDF'
Because every word just appears once in each document, the TF-IDF values are the IDF values of each word times 1:
tfidf_vectorizer = TfidfVectorizer(norm=None)
tfidf = tfidf_vectorizer.fit_transform(df[0])
print(tfidf_vectorizer.idf_)
output: Smoothed IDF-values
[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355
1.91629073 1.91629073 1.51082562 1.91629073 1.51082562 1.91629073
1.51082562]
I hope, the above is helpful to you.
Unfortunately, I cannot reproduce the transformation, because
The cosine similarity between two vectors is their dot product when l2
norm has been applied.
seems to be an additional step. Because the TF-IDF values will be biased by the number of words in each document when you use the default setting norm='l2', I would simply turn this setting off by using norm=None. I figured out, that you cannot simply do the transformation by using:
tfidf_norm_calculated = [
[(word/sum(doc))**0.5 for word in doc]
for doc in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray()]
print(tfidf_norm_calculated)
print('Sum of squares of vector elements is 1: ', [sum([word**2 for word in doc]) for doc in tfidf_norm_calculated])
print('Compare to:', TfidfVectorizer().fit_transform(df[0]).toarray())

Python - Compare Tuples in a List

So in a program I am creating I have a list that contains tuples, and each tuple contains 3 numbers. For example...
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1),...]
Now I want to delete any tuple whose last two numbers are less than any other tuple's last two numbers are.
The first number has to be the same to delete the tuple. *
So with the list of tuples above this would happen...
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1),...]
# some code...
result = [(1, 2, 4), (2, 4, 1), (1, 5, 2)]
The first tuple is not deleted because (2 and 4) are not less than (4 and 1 -> 2 < 4 but 4 > 1), (1 and 5 -> 2 > 1), or (4 and 1 -> 2 < 4 but 4 > 1)
The second tuple is not deleted because its first number (2) is different than every other tuples first number.
The third tuple is not deleted for the same reason the first tuple is not deleted.
The fourth tuple is deleted because (4 and 1) is less than (5 and 2 -> 4 < 5 and 1 < 2)
I really need help because I am stuck in my program and I have no idea what to do. I'm not asking for a solution, but just some guidance as to how to even begin solving this. Thank you so much!
I think this might actually work. I just figured it out. Is this the best solution?
results = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1)]
for position in results:
for check in results:
if position[0] == check[0] and position[1] < check[1] and position[2] < check[2]:
results.remove(position)
Simple list comprehension to do this:
[i for i in l if not any([i[0]==j[0] and i[1]<j[1] and i[2]<j[2] for j in my_list])]
Your loop would work too, but be sure not to modify the list as you are iterating over it.
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1)]
results = []
for position in my_list:
for check in my_list:
if not (position[0] == check[0] and position[1] < check[1] and position[2] < check[2]):
results.append(position)
results
>[(1, 2, 4), (2, 4, 1), (1, 5, 2)]

How do vector applications skew polygons?

I know how to move, rotate, and scale, but how does skewing work? what would I have to do to a set of verticies to skew them?
Thanks
Offset X values by an amount that varies linearly with the Y value (or vice versa).
Edit: Doing this with a rectangle:
Let's say you start with a rectangle (0, 0), (4, 0), (4, 4), (0, 4). Let's assume you want to skew it with a slope of 2, so as it goes two units up, it'll move one to the right, something like this (hand drawn, so the angle's undoubtedly a bit wrong, but I hope it gives the general idea):
To get this, each X value is adjusted like:
X = X + Y * S
where S is the inverse of the slope of the skew. In this case, the slope is 2, so S = 1/2. Working that for our four corners, we get:
(0, 0) => 0 + 0 / 2 = 0 => (0, 0)
(4, 0) => 4 + 0 / 2 = 4 => (4, 0)
(4, 4) => 4 + 4 / 2 = 6 => (6, 4)
(0, 4) => 0 + 4 / 2 = 2 => (2, 4)
Skewing / shearing is described in detail at http://en.wikipedia.org/wiki/Shear_mapping and http://mathworld.wolfram.com/ShearMatrix.html