Effects of Stemming on the term frequency? - data-mining

How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming?
Thanks!

tf is term frequency
idf is inverse document frequency which is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.
stemming effect is grouping all words which are derived from the same stem (ex: played, play,..), this grouping will increase the occurrence of this stem because frequencies are calculated using stem not words,
For example, if you have 2 documents:
the first one contains 'play' 2 times and 'played' 5 times,
and the second document contains 'play' 3 times and 'played' 1 time
if you do a search for 'play' without stemming the second document will be first because it has more occurrence of the word 'play', while if you do stemming, both words will be 'play' after stemming and the first document will be first cause it contains the stem play 7 times and the second document contains the stem play 4 times.
Concerning stopwords removal, it is found frequently in all document and isn't consider as a keyword for any of them, it will have high freq without any scene.

Related

Trying to find Top 10 products within categories through Regex

I have a ton of products, separated into different categories.
I've aggregated each products revenue, within their category and I now need to locate the top 10.
The issue is, that not every product have sold within a given timeframe, or some category doesn't even have 10 products, leaving me with fewer than 10 values.
As an example, these are some of the values:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,3,5,6,20,46,47,53,78,92,94,111,115,139,161,163,208,278,291,412,636,638,729,755,829,2673
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,57,124,158,207,288,547
0,0,90,449,1590,10492
0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,7,12,14,32,32,37,62,64,64,64,94,100,103,109,113,114,114,129,133,148,152,154,160,167,177,188,205,207,207,209,214,214,224,225,238,238,244,247,254,268,268,285,288,298,301,305,327,333,347,348,359,362,368,373,402,410,432,452,462,462,472,482,495,511,512,532,566,597,599,600,609,620,636,639,701,704,707,728,747,768,769,773,805,833,899,937,1003,1049,1150,1160,1218,1230,1262,1327,1377,1396,1474,1532,1547,1565,1760,1768,1836,1962,1963,2137,2293,2423,2448,2451,2484,2529,2609,3138,3172,3195,3424,3700,3824,4310,4345,4415,4819,4943,5083,5123,5158,5334,5734,6673,7160,7913,9298,9349,10148,11047,11078,12929,18535,20756,28850,63447
63,126
How would you get as close as possible to capturing the top 10 within a category, and how would you ensure that it is only products that have sold, that are included as a possibility? And all of this through Regex.
My current setup is only finding top 3 and a very basic setup:
Step 1: ^.*\,(.*\,.*\,.*)$ finding top 3
Step 2: ^(.*)\,.*\,.*$ finding the lowest value of the top 3 products
Step 3: Checking if original revenue value is higher than, or equal to, step 2 value.
Step 4: If yes, then bestseller, otherwise just empty value.
Thanks in advance
You didn't specify a programming language so I'm going with Javascript here but this regex is quite compatible with almost any regex flavor:
(?:[1-9]\d*,){0,9}[1-9]\d*$
(?:[1-9]\d*,){0,9} - between 0 and 9 times, find numbers followed by a comma; ignore zero revenue
[1-9]\d* - guarantee a non-zero revenue one time
$ - end line anchor
https://regex101.com/r/1xBQD3/1
If your data were to have leading zeros like 0,0,00090,00449,01590,10492 for some reason then you would need this regex which is 33% more expensive:
(?:0*[1-9]\d*,){0,9}0*[1-9]\d*$

converting a sentence to an embedding representation

If I have a sentence, ex: “get out of here”
And I want to use word2vec Embed. to represent it .. I found three different ways to do that:
1- for each word, we compute the AVG of its embedding vector, so each word replaced by a single value.
2- as in 1, but with using the standard deviation of the embedding vector values.
3- or by adding the Embed. vector as it is. So if I use 300 length embedding vector .. for the above example, I will have in the final a vector of (300 * 4 words) 1200 length as a final vector to represent the sentence.
Which one of them is most suitable .. ? specifically, for the sentence similarity applications ..
The way you describe option (1) makes it sound like each word becomes a single number. That wouldn't work.
The simple approach that's often used is to average all word-vectors for words in the sentence together - so with 300-dimensional word-vectors, you still wind up with a 300-dimensional sentence-average vector. Perhaps that's what you mean by your option (1).
(Sometimes, all vectors are normalized to unit-length before this operation, but sometimes not - because the non-normalized vector lengths can sometimes indicate the strength of a word's meaning. Sometimes, word-vectors are weighted by some other frequency-based indicator of their relative importance, such as TF/IDF.)
I've never seen your option (2) used and don't quite understand what you mean or how it could possibly work.
Your option (3) would be better described as "concatenating the word-vectors". It gives different-sized vectors depending on the number of words in the sentence. Slight differences in word placement, such as comparing "get out of here" and "of here get out", would result in very different vectors, that usual methods of comparing vectors (like cosine-similarity) would not detect as being 'close' at all. So it doesn't make sense, and I've not seen it used.
So, only your option (1), as properly implemented to (weighted-)average word-vectors, is a good baseline for sentence-similarities.
But, it's still fairly basic and there are many other ways to compare sentences using text-vectors. Here are just a few:
One algorithm closely related to word2vec itself is called 'Paragraph Vectors', and is often called Doc2Vec. It uses a very word2vec-like process to train vectors for full ranges of text (whether they're phrases, sentences, paragraphs, or documents) that work kind of like 'floating document-ID words' over the full text. It sometimes offers a benefit over just averaging word-vectors, and in some modes can produce both doc-vectors and word-vectors that are also comparable to each other.
If your interest isn't just pairwise sentence similarities, but some sort of downstream classification task, then Facebook's 'FastText' refinement of word2vec has a classification mode, where the word-vectors are trained not just to predict neighboring words, but to be good at predicting known text classes, when simply added/averaged together. (Text-vectors constructed from such classification vectors might be good at similarities too, depending on how well the training-classes capture salient contrasts between texts.)
Another way to compute pairwise similarities, using just word-vectors, is "Word Mover's Distance". Rather than averaging all the word-vectors for a text together into a single text-vector, it considers each word-vector as a sort of "pile of meaning". Compared to another sentence, it calculates the minimum routing work (distance along lots of potential word-to-word paths) to move all the "piles" from one sentence into the configuration of another sentence. It can be expensive to calculate, but usually represents sentence-contrasts better than the simple single-vector-summary that naive word-vector averaging achieves.
`
model = Word2Vec(sentences,vector_size=100, min_count=1)
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
X=[]
for sentence in sentences:
X.append(sent_vectorizer(sentence, model))
print ("========================")
print (X)
`

String Finding Alg w/ Lowest Freq Char

I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.
Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?
You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where
n is the length of the text
p is the total number of matches found
This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:
If the letter is not part of a match, the algorithm takes unit time.
If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.
You could run an iteration loop that keeps a count of instances and has a check to see if a character has appeared more than a percentage of times based on total characters searched for and total length of the string. i.e. if you have 100 characters and 5 possibilities, any character that has appeared more than 20% of the hundred can be discounted, increasing efficiency by passing over any value matching that one.

Loop Feature Matching

Hello I have to achieve feature stereo matching for egomotion estimation.
From Paper "Multispectral Stereo Odometry " :
"The feature in the right image that maximizes the similarity
function for a given feature in the left image is selected as a
potential match. A threshold is then applied to keep only strong
matches. As stated above, the algorithm is fed with four images:
previous left (imLt−1), previous right (imRt−1), current left
(imLt), and current right (imRt). The matching is carried out
in a loop fashion [14] to keep only features that find their
correspondences across all four images. Fig. 4 illustrates the
different steps. We first start by finding stereo matches between
(imLt−1) and (imRt−1) (I). Then, sequential matches are found
between (imRt−1) and (imRt) (II). Another stereo matching
is performed between (imLt) and (imRt) (III). Finally, a
last sequential matching is performed between (imLt−1) and
(imLt) (IV). At this stage, if the starting and ending feature
points are identical, then the match is accepted. Otherwise, it is
simply rejected. This process is carried out for all the features
extracted in the first image (imLt−1)."
My question is : what does it means for "identical" when it refers to the first and last feature?
What does it means "a threshold is then applied"?
illustration of the loop matching steps
This is what I understood from the extract you posted:
Thresholding: I would say that the matching process is done first by comparing potential matches and computing their similarity, then by finding the match with the highest similarity. Once you've found it, you should compare that similarity against a pre-defined threshold value T. If the match similarity is below the threshold, then you discard the match. In order to detect the best threshold T, I'd try some values and see what happens.
Identical match: From what I understood, the authors perform the matching process in a loop: starting from a point P in imL(t-1), they perform a stereo matching process towards imR(t-1), then a sequential matching between imR(t-1) and imR(t), then a stereo matching between imR(t) and imL(t) and a final sequential match between imL(t) and imL(t-1), obtaining a new point Q. If P and Q are the same point (in terms of spatial coordinates probably), then the loop matching process is considered successful.
EDIT: Could you add the title of the paper please?

Calculating tf-idf among documents using python 2.7

I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files.
From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf.
For example:
there are 50 different documents/texts files that consists 5000 words/strings each
i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do so for the second word and so on for all 50 documents/texts.
Expected output of each frequecy will be from 0 -1
How am i able to do so. I have been referring to sklear package but most of them only consists of a few strings in each comparisons.
You really should show us your code and explain in more detail which part it is that you are having trouble with.
What you describe is not usually how it's done. What you usually do is vectorize documents, then compare the vectors, which yields the similarity between any two documents under this model. Since you are asking about NLTK, I will proceed on the assumption that you want this regular, traditional method.
Anyway, with a traditional word representation, cosine similarity between two words is meaningless -- either two words are identical, or they're not. But there are certainly other ways you could approach term similarity or document similarity.
Copying the code from https://stackoverflow.com/a/23796566/874188 so we have a baseline:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
There is nothing here which depends on the length of the input. The number of features in idf will be larger if you have longer documents and there will be more of them in the corpus if you have more documents, but the algorithm as such will not need to change at all to accommodate more or longer documents.
If you don't want to understand why, you can stop reading here.
The vectors are basically an array of counts for each word form. The length of each vector is the number of word forms (i.e. the number of features). So if you have a lexicon with six entries like this:
0: a
1: aardvark
2: banana
3: fruit
4: flies
5: like
then the input document "a fruit flies like a banana" will yield a vector of six elements like this:
[2, 0, 1, 1, 1, 1]
because there are two occurrences of the word at index zero in the lexicon, zero occurrences of the word at index one, one of the one at index two, etc. This is a TF (term frequency) vector. It is already a useful vector; you can compare two of them using cosine distance, and obtain a measurement of their similarity.
The purpose of the IDF factor is to normalize this. The normalization brings three benefits; computationally, you don't need to do any per-document or per-comparison normalization, so it's faster; and the algorithm also normalizes frequent words so that many occurrences of "a" is properly regarded as insignificant if most documents contain many occurrences of this word (so you don't have to do explicit stop word filtering), whereas many occurrences of "aardvark" is immediately obviously significant in the normalized vector. Also, the normalized output can be readily interpreted, whereas with plain TF vectors you have to take document length etc. into account to properly understand the result of the cosine similarity comparison.
So if the DF (document frequency) of "a" is 1000, and the DF of the other words in the lexicon is 1, the scaled vector will be
[0.002, 0, 1, 1, 1, 1]
(because we take the inverse of the document frequency, i.e. TF("a")*IDF("a") = TF("a")/DF("a") = 2/1000).
The cosine similarity basically interprets these vectors in an n-dimensional space (here, n=6) and sees how far from each other their arrows are. Just for simplicity, let's scale this down to three dimensions, and plot the (IDF-scaled) number of "a" on the X axis, the number of "aardvark" occurrences on the Y axis, and the number of "banana" occurrences on the Z axis. The end point [0.002, 0, 1] differs from [0.003, 0, 1] by just a tiny bit, whereas [0, 1, 0] ends up at quite another corner of the cube we are imagining, so the cosine distance is large. (The normalization means 1.0 is the maximum of any element, so we are talking literally a corner.)
Now, returning to the lexicon, if you add a new document and it has words which are not already in the lexicon, they will be added to the lexicon, and so the vectors will need to be longer from now on. (Vectors you already created which are now too short can be trivially extended; the term weight for the hitherto unseen terms will obviously always be zero.) If you add the document to the corpus, there will be one more vector in the corpus to compare against. But the algorithm doesn't need to change; it will always create vectors with one element per lexicon entry, and you can continue to compare these vectors using the same methods as before.
You can of course loop over the terms and for each, synthesize a "document" consisting of just that single term. Comparing it to other single-term "documents" will yield 0.0 similarity to the others (or 1.0 similarity to a document containing the same term and nothing else), so that's not too useful, but a comparison against real-world documents will reveal essentially what proportion of each document consists of the term you are examining.
The raw IDF vector tells you the relative frequency of each term. It usually expresses how many documents each term occurred in (so even if a term occurs more than once in a document, it only adds 1 to the DF for this term), though some implementations also allow you to use the bare term count.