How can i find the relevance of a page? - data-mining

Say I have some features of a person and then i perform google search.
A list of results display I want to mine those results and check how the contents of those results are relevant to that particular person.

If you need to perform a Google search, the relevancy is best computed using
relevancy = cos(x) * (sqrt(x) + atan2(x,y)) * sin(0)
This is the universal formula for relevancy. You can apply it to people or to questions on SO.

Related

Vector embeddings to mimic a ranking algorithm

Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.
I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.
I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.
Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.
There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.
Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.
You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).
And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.
Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:
https://en.wikipedia.org/wiki/Learning_to_rank
To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.
For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.
Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.

What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2?

I am using Hyperparameter using HParams Dashboard in Tensorflow 2.0-beta0 as suggested here https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I am confused in step 1, I could not find any better explanation. My questions are related to following lines:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
My question:
I want to try more dropout values instead of just two (0.1 and 0.2). If I write more values in it then it throws an error- 'maximum 2 arguments can be given'. I tried to look for documentation but could not find anything like from where these hp.Discrete and hp.RealInterval functions came.
Any help would be appreciated. Thank you!
Good question. They notebook tutorial lacks in many aspects. At any rate, here is how you do it at a certain resolution res
for dropout_rate in tf.linspace(
HP_DROPOUT.domain.min_value,
HP_DROPOUT.domain.max_value,
res,):
By looking at the implementation to me it really doesn't seem to be GridSearch but MonteCarlo/Random search (note: this is not 100% correct, please see my edit below)
So on every iteration a random float of that real interval is chosen
If you want GridSearch behavior just use "Discrete". That way you can even mix and match GridSearch with Random search, pretty cool!
Edit: 27th of July '22: (based on the comment of #dpoiesz)
Just to make it a little more clear, as it is sampled from the intervals, concrete values are returned. Therefore, those are added to the grid dimension and grid search is performed using those
RealInterval is a min, max tuple in which the hparam will pick a number up.
Here a link to the implementation for better understanding.
The thing is that as it is currently implemented it does not seems to have any difference in between the two except if you call the sample_uniform method.
Note that tf.linspace breaks the mentioned sample code when saving current value.
See https://github.com/tensorflow/tensorboard/issues/2348
In particular OscarVanL's comment about his quick&dirty workaround.

Similarity of a group of text documents

I am looking for an algorithm that tries to check
1) the similarity of sentences (around 5000) with each other in a document
2) the similarity of multiple documents (around 5000) with respect to each other
I need the same because I'm trying to evaluate whether the text documents/ sentences coming under a particular category are in any manner similar to each other . Are there any existing methods for doing the same.
The standard approach is to use cosine similarity, with TF-IDF normalization.
There are many variants of this, you will need to experiment what works best for you.

Get goal conversion rate by Time on Page by Page group?

I want to answer this question:
Does the average time on page A (or more accurately page group A) affect the conversion rate of goal B?
So far in the GUI I have:
A) Created an advanced segment of Time on Page >= 120 ("per hit" option):
http://grab.by/tKOA
B) Modified the segment to also add a filter for Page = regex matching my group:
http://grab.by/tKOU
...But I don't know if this gives me the results I'm after; that is, if they are accurate
I have some other ideas, including assigning the page group as a funnel step and then segmenting by the Time on Page; still waiting on data to come in for that one
Wanting to know if there was a better solution or if I'm on the right track
Drewdavid,
Your approach is quite smart and correct, I would say, however keep in mind that in this context, you are mixing different scopes:
Time on Page is page-level metric
Page seen is visit-level dimension
What you would get in your report is the average time on page calculated from all the pages there were seen during visits which met the regex condition set in the filter (that's what segment does, it included all the pages, not just those that you want to filter). I know this can be confusing, but see this article that gives more examples and goes into greater detail.
To achieve what you are after, remove the segment filter and simply use the advanced filtering above the report table (and choose exactly the same regex you mentioned in your question).
Hope this helps!

What is the best data mining method for vehicle search?

I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is:
2003 Ford F150.
However on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?
You could use the Levenshtein distance to match the found string against your database records.
Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.
If you're gonna develop a whole search engine intended to scale in both, usage and size, you will need something robust to support your queries.
If you're gonna used edit distance, Bed-trees provide a good alternative for your index structure. Another good approach, depending on the size of your dataset, is to use a Levenshtein automata. Levenshtein automatas are also great at providing auto-complete functionalities, which you may need since you're developing a search engine.
Another approach to edit distance is to use n-grams combined with Jaccard index. For this approach you can use Minhash + LSH. Also, you can use Jaccard as a distance metric (1 - Jaccard index) which respects the triangle inequality, thus, can be used in a metric tree such as a VP-tree.
One of these approaches will certainly help you.