Python: How to calculate tf-idf for a large data set

Python: How to calculate tf-idf for a large data set - python-2.7

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?

I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.

As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

difflib.get_close_matches GET SCORE

I am trying to get the score of the best match using difflib.get_close_matches:
import difflib
best_match = difflib.get_close_matches(str,str_list,1)[0]
I know of the option to add 'cutoff' parameter, but couldn't find out how to get the actual score after setting the threshold.
Am I missing something? Is there a better solution to match unicode strings?

I found that difflib.get_close_matches is the simplest way for matching/fuzzy-matching strings. But there are a few other more advanced libraries like fuzzywuzzy as you mentioned in the comments.
But if you want to use difflib, you can use difflib.SequenceMatcher to get the score as follows:
import difflib
my_str = 'apple'
str_list = ['ape' , 'fjsdf', 'aerewtg', 'dgyow', 'paepd']
best_match = difflib.get_close_matches(my_str,str_list,1)[0]
score = difflib.SequenceMatcher(None, my_str, best_match).ratio()
In this example, the best match between 'apple' and the list is 'ape' and the score is 0.75.
You can also loop through the list and compute all the scores to check:
for word in str_list:
print "score for: " + my_str + " vs. " + word + " = " + str(difflib.SequenceMatcher(None, my_str, word).ratio())
For this example, you get the following:
score for: apple vs. ape = 0.75
score for: apple vs. fjsdf = 0.0
score for: apple vs. aerewtg = 0.333333333333
score for: apple vs. dgyow = 0.0
score for: apple vs. paepd = 0.4
Documentation for difflib can be found here: https://docs.python.org/2/library/difflib.html

To answer the question, the usual route would be to obtain the comparative score for a match returned by get_close_matches() individually in this manner:
match_ratio = difflib.SequenceMatcher(None, 'aple', 'apple').ratio()
Here's a way that increases speed in my case by about 10% ...
I'm using get_close_matches() for spellcheck, it runs SequenceMatcher() under the hood but strips the scores returning just a list of matching strings. Normally.
But with a small change in Lib/difflib.py currently around line 736 the return can be a dictionary with scores as values, thus no need to run SequenceMatcher again on each list item to obtain their score ratios. In the examples I've shortened the output float values for clarity (like 0.8888888888888888 to 0.889). Input n=7 says to limit the return items to 7 if there are more than 7, i.e. the highest 7, and that could apply if candidates are many.
Current mere list return
In this example result would normally be like ['apple', 'staple', 'able', 'lapel']
... at the default cutoff of .6 if omitted (as in Ben's answer, no judgement).
The change
in difflib.py is simple (this line to the right shows the original):
return {v: k for (k, v) in result} # hack to return dict with scores instead of list, original was ... [x for score, x in result]
New dictionary return
includes scores like {'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667}
>>> to_match = 'aple'
>>> candidates = ['lapel', 'staple', 'zoo', 'able', 'apple', 'appealing']
Increasing minimum score cutoff/threshold from .4 to .8:
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.4)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667, 'appealing': 0.461}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.7)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.8)
{'apple': 0.889, 'staple': 0.8}

How can I sort and include rules sorting in django?

I'm not to confident that I am asking this question correctly but this is what I'd like to do.
In django admin, I would like to write an action that sorts the list of my contestants randomly and doesn't allow two people with the same first name to be within 4 records of eachother. So basically,
if you have John L. John C. Carey J, Tracy M. Mary T., the records would be listed like this:
John L.
Mary T.
Carey J.
Tracy T.
John C.
OR
How can I write an action that would create random groups where two people with the same name wouldn't be within the same group like so:
John L. John C. Carey J, Tracy M. Mary T. =
Group 1
John L.
Mary T.
Carey J.
Tracy T.
Group 2
John C.
Forgive me if it isn't very clear, let me know and I'll try to specify further but any help would be appreciated
EDIT:
Is this what you are referring to? I can't quite figure out how to compare the fields to see if they are the same
Model:
class people(models.Model)
fname = model.CharField()
lname = model.CharField()
group = model.IntegerField()
View:
N = 4
Num = randint(0, N-1)
for x in queryset:
x.group = Num
if group == group| fname == fname | lname == lname:
x.group = (Num + 1) % N

Your first question cannot be solved always. Just think of all contestants have the same name, then you actually cannot find a solution to it.
For the second question, I can suggest an algorithm to do that, though.
Since I do not see any restriction on the number of groups, I will suggest a method to create the least number of groups here.
EDIT: I assumed you don't want 2 people with same "First name" in a group.
The steps are
Count the appearance of each name
count = {}
for x in queryset:
if x.fname not in count:
count[x.fname] = 0
count[f.name] += 1
Find the name with the most appearance
N = 0
for x in queryset:
if count[x.fname] > N:
N = count[x.fname]
Create N groups, where N equals to the number of appearance of the name in step 2
For each name, generate a random number X, where X < N.
Try to put the name into group X. If group X has that name already, set X = (X + 1) % N and retry, repeat until success. You will always find a group to put the contestant.
from random import randint
groups = [[]] * N
for item in queryset:
X = randint(0, N-1)
while item.fname in groups[X]:
X = (X + 1) % N
groups[X].append(item.fname)
item.group = X
EDIT:
Added details in steps 1, 2, 4.
From the code segment in your edited, I think you do not actually need a definition of "group" in model, as seems you only need a group number for it.

Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib

Usecase : Extending the pivot functionality of Pandas. Fetch top n records & plot them against its own "Click %"(s) vs. no of records of that name
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'name':['A', 'A', 'B', 'B','C','A'], 'click':[1,1,0,1,1,0]})
click name
0 1 A
1 1 A
2 0 B
3 1 B
4 1 C
5 0 A
[6 rows x 2 columns]
#fraction of records present & clicks as a fraction of it's OWN records present
f=df1.pivot_table(rows='name', aggfunc=[len, np.sum])
f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click'])
(name
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
Name: click, dtype: float64)
But to be able to plot them need to store the top n records in an object that is supported by matplotlib.
I tried storing the
"top names" A,B, C ..etc by creating dict (output of
f['len']['click']/sum(f['len']['click']
) )- and sorted by values - after which I stored the "click %" [A -> 0.50, B -> 0.25 , C-> 0.25] also in the same dictionary.
**Since this is clearly an overkill - wondering if there's a more pythonic way to do this ? **
I also tried head with groupby clause, but it doesn't give me what I am looking for. I am looking for a dataframe as above
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
except that the top n logic should be embedded (head(n) does not work with n depends on my data-set - I guess I need to use "apply" ? - and post this the Object , which is a "" object needs to be identified by matplotlib with its own labels (top n "name" here)
Here's my dict function implementation :- # This is an OVERKILL just to fetch top n by a custom criteria as above
def freq_counts(df_var,n): # df_var is like df1.name , just to make the top n logic generic for each column name
perct_freq=dict((df_var.value_counts()*100)/len(df_var))
vec=[]
for key,value in perct_freq.items():
if value>=n :
vec.append([key,value])
return vec
freq_counts(df1.name,3) # eg. top 3 freq counts - to get the names, see vec[i][0] which has the corresponding keys
#In this example when I calculate the "perct_freq", which is a Series object, I would ideally want to avoid converting this to a dict - What an overkill !
Store the actual occurances (len of names) , and find the fraction of a "name" in population
Against this, also fins the "sucess outcome" and find it as a fraction of its OWN population
Finally plot top n name(s), output of (1) & (2) in same plot - criteria for top n should be based on (1) as a percentage
Ie. for (1) & (2) use dataframes that support plot with
name as labels in x axis
(1) as y axis (primary)
(2) as y axis (secondary)
PPS: In the code above -
(1) is > f['len']['click']/sum(f['len']['click']) and
(2) is > f['sum']['click']/sum(f['sum']['click'])

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.

Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH

A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python: How to calculate tf-idf for a large data set - python-2.7

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

difflib.get_close_matches GET SCORE

How can I sort and include rules sorting in django?

Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib

Need help in improving the speed of my code for duplicate columns removal in Python

Categories

Resources