How to assign more weight to bigram and trigram? - data-mining

I have to match the title of two research papers by using n-gram (uni, bi and tri only)
I have been asked by my supervisor that while matching i have to assign more weight to bigram matched terms score than unigram matched terms score and more weight to trigram matched terms score than bigram matched terms score.
For example, two bigrams are matched in title then the score=2
and two tigrams are matched then the score=2
I have to look for some values and then multiply it to the scores that will increase trigram score and decrease bigram score
I looked for research papers related to this problem but i couldn't get any help from there. :(
Can anyone give some idea or some link to the document which may solve the issue??

in interpolation, we always mix the probability estimates from all the N-gram estimators, weighing and combining the trigram, bigram, and unigram counts.
In simple linear interpolation, we combine different order N-grams by linearly interpolating all the models. Thus, we estimate the trigram probability P(wn|wn−2wn−1) by mixing together the unigram, bigram, and trigram probabilities, each weighted by a λ:
such that the λs sum to 1:


Weighted Average Calculations across various combinations using Cube.js

We have a question on designing schema and handling analytics requirement for our product and would appreciate your advise on this. We are just getting started with Cube.js. Here is our req: We have data (for simplicity...i will use an example) where say we have multiple columns (attributes) and say 1 "value" and 1 "weight" column. We need to calculate weighted averages across all combinations of the columns (attributes) and the value / weight columns.
e.g. Group by Column 1 and weighted average (value/Weight column)
or Group by Column 1, 2 and weighted average etc. etc...
it can be many types of combinations and we have atleast 8 to 12 columns like that
Wondering how best to model?
Probably for you will be convenient to create one cube with several predefined segments or also you can create several cubes per each attribute.
It depends on your data.

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

Finding the Distance Between Two Lines that represent GPS routes (MATLAB, Java, C++, or Python)

I have been researching and trying to figure this one out to no avail. I have found many ways not to solve this...
The gist of the problem: I am looking for a method to calculate the deviance from an original path traveled by way of GPS coordinates. I have multiple csv files that contain latitude, longitude, and UTC time. I have created KML files from this information for a visual viewing of the deviance and now would like to put a value on this deviation. I ahve chosen a route as a reference and would like to measure the other routes against the reference route. There are multiple routes each having it's own reference route, each of which has many runs. No two runs are the same, and some of the routes deviate more than the next. I cannot use time, only lat and lon since the runs were completed over many weeks of data collection.
What I have tried thus far:
Haversine and Equirectangular formulas (looping through and measuring point to point).
Outcome: The coordinates only line up for a short period of time and the difference in the number of points varies greatly.
Area under each curve: was going to find the difference of the two routes by this method.
Outcome: Really unsure how to proceed, nor find equations suitable for this calculation.
There were a couple more feeble attempts, but have been working on this for a few weeks now, with not much to show for and still unsure on how to proceed.
Any help or ideas would be greatly appreciated.
Possible solution 1: Instead of calculating the "sideways" deviation between the two routes, just compare the respective arc lengths (Matlab: arclength).
Possible solution 2: To compare two routes, each going from the same start A to the same end point B: Draw a straight line between A and B, place a number of equidistant points along AB, and then average the perpendicular distance from these points on AB to the paths you want to compare. The absolute difference between the cumulative deviations from the straight-line reference is your deviation.
Possible solution 3: Calculate the arc length of each route. Place a number of equidistant points along each route. Average the distance between these points.
Both solution 2 and 3 will depend on the number of points you place, but with a higher number of points, the average deviation will converge. Note that these solutions are both related to calculating the area under each curve.

Given a set of points, find smallest subset of points from which circles of n diameter can be drawn to encompass all points

I've got a list of places with associated lat/lon data (sites). I'm trying to find the fewest bases from which to visit the sites (minimizing travel occurrences). Any ideas? I've mostly been working with Python (2.7.3), but any suggestions/examples are welcome.
This can be viewed as the set cover problem.
Using Wikipedia's terminology, your universe will be the cities. If there are m cities, there will be m sets. k-th set will correspond to the k-th city and will include all cities within the required travel radius from k, including k itself. The task is to find the smallest number of sets that cover the universe (put another way, the smallest number of cities from which you can reach every city in your universe).
The bad news is that the problem is NP-hard. There are, however, heuristics.

Minimizing Sum of Distances: Optimization Problem

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..
The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.
This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.