how the TFIDF values are transformed - python-2.7

I am new to NLP, please clarify on how the TFIDF values are transformed using fit_transform.
Below formula for calculating the IDF is working fine,
log (total number of documents + 1 / number of terms occurrence + 1) + 1
EG: IDF value for the term "This" in the document 1("this is a string" is 1.91629073
After applying fit_transform, values for all the terms are changed, what is the formula\logic used for the transformation
TFID = TF * IDF
EG: TFIDF value for the term "This" in the document 1 ("this is a string") is 0.61366674
How this value is arrived, 0.61366674?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
d = pd.Series(['This is a string','This is another string',
'TFIDF Computation Calculation','TFIDF is the product of TF and IDF'])
df = pd.DataFrame(d)
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df[0])
print (tfidf_vectorizer.idf_)
#output
#[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355 1.91629073
#1.91629073 1.51082562 1.91629073 1.51082562 1.91629073 1.51082562]
##-------------------------------------------------
##how the above values are getting transformed here
##-------------------------------------------------
print (tfidf.toarray())
#[[0. 0. 0. 0. 0. 0.49681612 0.
#0. 0.61366674 0. 0. 0. 0.61366674]
# [0. 0.61422608 0. 0. 0. 0.39205255
# 0. 0. 0.4842629 0. 0. 0. 0.4842629 ]
# [0. 0. 0.61761437 0.61761437 0. 0.
# 0. 0. 0. 0. 0.48693426 0. 0. ]
# [0.37718389 0. 0. 0. 0.37718389 0.24075159
# 0.37718389 0.37718389 0. 0.37718389 0.29737611 0.37718389 0. ]]

It's normed TF-IDF vectors because by default norm='l2' according to the documentation. So in the output of tfidf.toarray() each element on level 0 / row of the array represents a document and each element of level 1 / column represents a unique word with the sum of squares of vector elements for each document being equal to 1, which you can check by printing print([sum([word ** 2 for word in doc]) for doc in tfidf.toarray()]).
norm : ‘l1’, ‘l2’ or None, optional (default=’l2’)
Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two
vectors is their dot product when l2 norm has been applied. * ‘l1’:
Sum of absolute values of vector elements is 1. See
preprocessing.normalize
print(tfidf) #the same values you find in tfidf.toarray() but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) normed TF-IDF value
(0, 12) 0.6136667440107333 #1st word in 1st sentence: 'This'
(0, 5) 0.4968161174826459 #'is'
(0, 8) 0.6136667440107333 #'string', see that word 'a' is missing
(1, 12) 0.48426290003607125 #'This'
(1, 5) 0.3920525532545391 #'is'
(1, 8) 0.48426290003607125 #'string'
(1, 1) 0.6142260844216119 #'another'
(2, 10) 0.48693426407352264 #'TFIDF'
(2, 3) 0.6176143709756019 #'Computation'
(2, 2) 0.6176143709756019 #'Calculation'
(3, 5) 0.2407515909314943 #'is'
(3, 10) 0.2973761110467491 #'TFIDF'
(3, 11) 0.37718388973255157 #'the'
(3, 7) 0.37718388973255157 #'product'
(3, 6) 0.37718388973255157 #'of'
(3, 9) 0.37718388973255157 #'TF'
(3, 0) 0.37718388973255157 #'and'
(3, 4) 0.37718388973255157 #'IDF'
Because it's normed TF-IDF values the sum of squares of vector elements will be qual to 1. E.g. for the first document at index 0, the sum of squares of vector elements will be equal to 1: sum([0.6136667440107333 ** 2, 0.4968161174826459 ** 2, 0.6136667440107333 ** 2])
You can turn off this transformation by setting norm=None.
print(TfidfVectorizer(norm=None).fit_transform(df[0])) #the same values you find in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray(), but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) TF-IDF value
(0, 12) 1.5108256237659907 #1st word in 1st sentence: 'This'
(0, 5) 1.2231435513142097 #'is'
(0, 8) 1.5108256237659907 #'string', see that word 'a' is missing
(1, 12) 1.5108256237659907 #'This'
(1, 5) 1.2231435513142097 #'is'
(1, 8) 1.5108256237659907 #'string'
(1, 1) 1.916290731874155 #'another'
(2, 10) 1.5108256237659907 #'TFIDF'
(2, 3) 1.916290731874155 #'Computation'
(2, 2) 1.916290731874155 #'Calculation'
(3, 5) 1.2231435513142097 #'is'
(3, 10) 1.5108256237659907 #'TFIDF'
(3, 11) 1.916290731874155 #'the'
(3, 7) 1.916290731874155 #'product'
(3, 6) 1.916290731874155 #'of'
(3, 9) 1.916290731874155 #'TF'
(3, 0) 1.916290731874155 #'and'
(3, 4) 1.916290731874155 #'IDF'
Because every word just appears once in each document, the TF-IDF values are the IDF values of each word times 1:
tfidf_vectorizer = TfidfVectorizer(norm=None)
tfidf = tfidf_vectorizer.fit_transform(df[0])
print(tfidf_vectorizer.idf_)
output: Smoothed IDF-values
[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355
1.91629073 1.91629073 1.51082562 1.91629073 1.51082562 1.91629073
1.51082562]
I hope, the above is helpful to you.
Unfortunately, I cannot reproduce the transformation, because
The cosine similarity between two vectors is their dot product when l2
norm has been applied.
seems to be an additional step. Because the TF-IDF values will be biased by the number of words in each document when you use the default setting norm='l2', I would simply turn this setting off by using norm=None. I figured out, that you cannot simply do the transformation by using:
tfidf_norm_calculated = [
[(word/sum(doc))**0.5 for word in doc]
for doc in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray()]
print(tfidf_norm_calculated)
print('Sum of squares of vector elements is 1: ', [sum([word**2 for word in doc]) for doc in tfidf_norm_calculated])
print('Compare to:', TfidfVectorizer().fit_transform(df[0]).toarray())

Related

how can i divide a list as subsets which are sum upto given number(non repeat)?

from given list of numbers
nums=[4,3,2,3,5,2,1]
from itertools import combinations
nums=[4,3,2,3,5,2,1]
li=[]
for i in range(1,len(nums)):
comb=combinations(nums,i)
for j in comb:
if sum(j)==5:
li.append(j)
print(li)
and output is
[(5,), (4, 1), (3, 2), (3, 2), (2, 3), (3, 2), (2, 2, 1)]
I am able to find the subsets but the elements seem to be repeated
so interested in non-repeating elements
I want the list of subsets that gives sum equal to 5
(without repetition)
example: [(5), (1, 4), (2,3), (2,3)]
If you change the loop slightly so that used numbers are removed from the list, they aren't reused in another sum, e. g.
i = 1
while i <= len(nums):
comb = combinations(nums, i)
for j in comb:
if sum(j) == 5:
li.append(j)
for n in j: nums.remove(n)
break
else: i += 1 # increment only if nothing found

How to calculate distances from coordinates stored in lists

So far I managed to calculate the distances between an Point P(x,y) and a multitude of points stored in a list l = [(x1,y1), (x2,y2), (x3,y3), ...) Here is the code :
import math
import pprint
l = [(1,2), (2,3), (4,5)]
p = (3,3)
dists = [math.sqrt((p[0]-l0)**2 + (p[1]-l1)**2) for l0, l1 in l]
pprint.pprint(dists)
Output :
[2.23606797749979, 1.0, 2.23606797749979]
Now I want to calculate the distances from multitude points in a new list to the points in the list l.
I haven't found a solution yet, so does anyone have an idea how this could be done?
Here is a possible solution:
from math import sqrt
def distance(p1, p2):
return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
lst1 = [(1,2), (2,3), (4,5)]
lst2 = [(6,7), (8,9), (10,11)]
for p1 in lst1:
for p2 in lst2:
d = distance(p1, p2)
print(f'Distance between {p1} and {p2}: {d}')
Output:
Distance between (1, 2) and (6, 7): 7.0710678118654755
Distance between (1, 2) and (8, 9): 9.899494936611665
Distance between (1, 2) and (10, 11): 12.727922061357855
Distance between (2, 3) and (6, 7): 5.656854249492381
Distance between (2, 3) and (8, 9): 8.48528137423857
Distance between (2, 3) and (10, 11): 11.313708498984761
Distance between (4, 5) and (6, 7): 2.8284271247461903
Distance between (4, 5) and (8, 9): 5.656854249492381
Distance between (4, 5) and (10, 11): 8.48528137423857

Python - Compare Tuples in a List

So in a program I am creating I have a list that contains tuples, and each tuple contains 3 numbers. For example...
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1),...]
Now I want to delete any tuple whose last two numbers are less than any other tuple's last two numbers are.
The first number has to be the same to delete the tuple. *
So with the list of tuples above this would happen...
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1),...]
# some code...
result = [(1, 2, 4), (2, 4, 1), (1, 5, 2)]
The first tuple is not deleted because (2 and 4) are not less than (4 and 1 -> 2 < 4 but 4 > 1), (1 and 5 -> 2 > 1), or (4 and 1 -> 2 < 4 but 4 > 1)
The second tuple is not deleted because its first number (2) is different than every other tuples first number.
The third tuple is not deleted for the same reason the first tuple is not deleted.
The fourth tuple is deleted because (4 and 1) is less than (5 and 2 -> 4 < 5 and 1 < 2)
I really need help because I am stuck in my program and I have no idea what to do. I'm not asking for a solution, but just some guidance as to how to even begin solving this. Thank you so much!
I think this might actually work. I just figured it out. Is this the best solution?
results = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1)]
for position in results:
for check in results:
if position[0] == check[0] and position[1] < check[1] and position[2] < check[2]:
results.remove(position)
Simple list comprehension to do this:
[i for i in l if not any([i[0]==j[0] and i[1]<j[1] and i[2]<j[2] for j in my_list])]
Your loop would work too, but be sure not to modify the list as you are iterating over it.
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1)]
results = []
for position in my_list:
for check in my_list:
if not (position[0] == check[0] and position[1] < check[1] and position[2] < check[2]):
results.append(position)
results
>[(1, 2, 4), (2, 4, 1), (1, 5, 2)]

codility MaxDistanceMonotonic, what's wrong with my solution

Question:
A non-empty zero-indexed array A consisting of N integers is given.
A monotonic pair is a pair of integers (P, Q), such that 0 ≤ P ≤ Q < N and A[P] ≤ A[Q].
The goal is to find the monotonic pair whose indices are the furthest apart. More precisely, we should maximize the value Q − P. It is sufficient to find only the distance.
For example, consider array A such that:
A[0] = 5
A[1] = 3
A[2] = 6
A[3] = 3
A[4] = 4
A[5] = 2
There are eleven monotonic pairs: (0,0), (0, 2), (1, 1), (1, 2), (1, 3), (1, 4), (2, 2), (3, 3), (3, 4), (4, 4), (5, 5). The biggest distance is 3, in the pair (1, 4).
Write a function:
int solution(vector &A);
that, given a non-empty zero-indexed array A of N integers, returns the biggest distance within any of the monotonic pairs.
For example, given:
A[0] = 5
A[1] = 3
A[2] = 6
A[3] = 3
A[4] = 4
A[5] = 2
the function should return 3, as explained above.
Assume that:
N is an integer within the range [1..300,000];
each element of array A is an integer within the range [−1,000,000,000..1,000,000,000].
Complexity:
expected worst-case time complexity is O(N);
expected worst-case space complexity is O(N), beyond input storage (not counting the storage required for input arguments).
Elements of input arrays can be modified.
Here is my solution of MaxDistanceMonotonic:
int solution(vector<int> &A) {
long int result;
long int max = A.size() - 1;
long int min = 0;
while(A.at(max) < A.at(min)){
max--;
min++;
}
result = max - min;
while(max < (long int)A.size()){
while(min >= 0){
if(A.at(max) >= A.at(min) && max - min > result){
result = max - min;
}
min--;
}
max++;
}
return result;
}
And my result is like this, what's wrong with my answer for the last test:
If you have:
0 1 2 3 4 5
31 2 10 11 12 30
Your algorithm outputs 3, but the correct answer is 4 = 5 - 1.
This happens because your min goes to -1 on the first full run of the inner while loop, so the pair (1, 5) will never have the chance to get checked, max starting out at 4 when entering the nested whiles.
Note that the problem description expects O(n) extra storage, while you use O(1). I don't think it's possible to solve the problem with O(1) extra storage and O(n) time.
I suggest you rethink your approach. If you give up, there is an official solution here.

How do vector applications skew polygons?

I know how to move, rotate, and scale, but how does skewing work? what would I have to do to a set of verticies to skew them?
Thanks
Offset X values by an amount that varies linearly with the Y value (or vice versa).
Edit: Doing this with a rectangle:
Let's say you start with a rectangle (0, 0), (4, 0), (4, 4), (0, 4). Let's assume you want to skew it with a slope of 2, so as it goes two units up, it'll move one to the right, something like this (hand drawn, so the angle's undoubtedly a bit wrong, but I hope it gives the general idea):
To get this, each X value is adjusted like:
X = X + Y * S
where S is the inverse of the slope of the skew. In this case, the slope is 2, so S = 1/2. Working that for our four corners, we get:
(0, 0) => 0 + 0 / 2 = 0 => (0, 0)
(4, 0) => 4 + 0 / 2 = 4 => (4, 0)
(4, 4) => 4 + 4 / 2 = 6 => (6, 4)
(0, 4) => 0 + 4 / 2 = 2 => (2, 4)
Skewing / shearing is described in detail at http://en.wikipedia.org/wiki/Shear_mapping and http://mathworld.wolfram.com/ShearMatrix.html