Using ConceptNet5 API to calculate the similarity between texts - word2vec

I have found the ConceptNet5 API can be used to calculate the similarity between two terms with each other, for example:
http://api.conceptnet.io/related/c/en/dog?filter=/c/en/cat/.&limit=1
However I was wondering if the API can also be used to calculate the similarity between two texts. For example can I use the API to calculate the similarity between the following:
"The sea is blue"
"The mountain is white"

Related

How to assign more weight to bigram and trigram?

I have to match the title of two research papers by using n-gram (uni, bi and tri only)
I have been asked by my supervisor that while matching i have to assign more weight to bigram matched terms score than unigram matched terms score and more weight to trigram matched terms score than bigram matched terms score.
For example, two bigrams are matched in title then the score=2
and two tigrams are matched then the score=2
I have to look for some values and then multiply it to the scores that will increase trigram score and decrease bigram score
I looked for research papers related to this problem but i couldn't get any help from there. :(
Can anyone give some idea or some link to the document which may solve the issue??
in interpolation, we always mix the probability estimates from all the N-gram estimators, weighing and combining the trigram, bigram, and unigram counts.
In simple linear interpolation, we combine different order N-grams by linearly interpolating all the models. Thus, we estimate the trigram probability P(wn|wn−2wn−1) by mixing together the unigram, bigram, and trigram probabilities, each weighted by a λ:
such that the λs sum to 1:

similarity measure scikit-learn document classification

I am doing some work in document classification with scikit-learn. For this purpose, I represent my documents in a tf-idf matrix and feed a Random Forest classifier with this information, works perfectly well. I was just wondering which similarity measure is used by the classifier (cosine, euclidean, etc.) and how I can change it. Haven't found any parameters or informatin in the documentation.
Thanks in advance!
As with most supervised learning algorithms, Random Forest Classifiers do not use a similarity measure, they work directly on the feature supplied to them. So decision trees are built based on the terms in your tf-idf vectors.
If you want to use similarity then you will have to compute a similarity matrix for your documents and use this as your features.

Finding the Distance Between Two Lines that represent GPS routes (MATLAB, Java, C++, or Python)

I have been researching and trying to figure this one out to no avail. I have found many ways not to solve this...
The gist of the problem: I am looking for a method to calculate the deviance from an original path traveled by way of GPS coordinates. I have multiple csv files that contain latitude, longitude, and UTC time. I have created KML files from this information for a visual viewing of the deviance and now would like to put a value on this deviation. I ahve chosen a route as a reference and would like to measure the other routes against the reference route. There are multiple routes each having it's own reference route, each of which has many runs. No two runs are the same, and some of the routes deviate more than the next. I cannot use time, only lat and lon since the runs were completed over many weeks of data collection.
What I have tried thus far:
Haversine and Equirectangular formulas (looping through and measuring point to point).
Outcome: The coordinates only line up for a short period of time and the difference in the number of points varies greatly.
Area under each curve: was going to find the difference of the two routes by this method.
Outcome: Really unsure how to proceed, nor find equations suitable for this calculation.
There were a couple more feeble attempts, but have been working on this for a few weeks now, with not much to show for and still unsure on how to proceed.
Any help or ideas would be greatly appreciated.
Possible solution 1: Instead of calculating the "sideways" deviation between the two routes, just compare the respective arc lengths (Matlab: arclength).
Possible solution 2: To compare two routes, each going from the same start A to the same end point B: Draw a straight line between A and B, place a number of equidistant points along AB, and then average the perpendicular distance from these points on AB to the paths you want to compare. The absolute difference between the cumulative deviations from the straight-line reference is your deviation.
Possible solution 3: Calculate the arc length of each route. Place a number of equidistant points along each route. Average the distance between these points.
Both solution 2 and 3 will depend on the number of points you place, but with a higher number of points, the average deviation will converge. Note that these solutions are both related to calculating the area under each curve.

Spatial queries on AWS SimpleDB

I would like to know what people suggest as efficient ways of doing a spatial query in an Amazon Web Services SimpleDB?
By spatial query I mean finding objects in a given radius of a latitude and longitude.
SimpleDB doesn't currently offer any built-in spatial search operations but that doesn't mean it can't be done. There's several methods of implementing geospatial searches in non-geospatially aware databases such as SimpleDB and all of them center around the idea of using the database to retrieve a rough first selection based on a geospatial bounding box and then filtering the returned data in your application using more accurate algorithms such as the Haversine formula.
You could store the latitude and longitude as (zero-padded and normalized) numeric attributes and then perform a double range query (lat >= minLat and lat <= maxLat and lon >= minLat and lon <= maxLat) but since neither of theese predicates are selective (each predicate matches a lot of items) it's not ideal (see Tuning Queries).
A better way would be using GeoHashes.
Geohashes offer properties like arbitrary precision, similar prefixes
for nearby positions, and the possibility of gradually removing
characters from the end of the code to reduce its size (and gradually
lose precision).
As a practical example, the Geohash 6gkzwgjzn820 decodes to the
coordinates -25.382708 and -49.265506, while the Geohash 6gkzwgjz will
decode to -25.383 and -49.266, and if we take a similar position in
the same region, such as -25.427 and -49.315, we can see it being
encoded as 6gkzmg1w (note the similar prefix).
From http://geohash.org/site/tips.html
With your item positions as GeoHashes you could use the like operator to search for a bounding box (where GeoHash like '6gkzmg1w%') but since the like operator is expensive (Comparison Operators) a better way would be to denormalize the data by storing each GeoHash prefix level (how many depends on your required search precision) as a separate attribute (GeoHash6 GeoHash8 etc) and then use a simple equality predicate (where Geohash8 = '6gkzmg1w').
Now on to the downside of GeoHashes. Since you can't make any assumption of a GeoHash being centered within your search box you have to search all neighboring prefixes as well. The process is excellently described by geohash-js
Geohash also has the property that as the number of digits decreases
(from the right), accuracy degrades. This property can be used to do
bounding box searches, as points near to one another will share
similar Geohash prefixes.
However, because a given point may appear at the edge of a given
Geohash bounding box, it is necessary to generate a list of Geohash
values in order to perform a true proximity search around a point.
Because the Geohash algorithm uses a base-32 numbering system, it is
possible to derive the Geohash values surrounding any other given
Geohash value using a simple lookup table.
So, for example, 1600 Pennsylvania Avenue, Washington DC resolves to:
38.897, -77.036
Using the geohash algorithm, this latitude and longitude is converted
to: dqcjqcp84c6e
A simple bounding box around this point could be described by
truncating this geohash to: dqcjqc
However, 'dqcjqcp84c6e' is not centered inside 'dqcjqc', and searching
within 'dqcjqc' may miss some desired targets.
So instead, we can use the mathematical properties of the Geohash to
quickly calculate the neighbors of 'dqcjqc'; we find that they are:
'dqcjqf','dqcjqb','dqcjr1','dqcjq9','dqcjqd','dqcjr4','dqcjr0','dqcjq8'
This gives us a bounding box around 'dqcjqcp84c6e' roughly 2km x 1.5km
and allows for a database search on just 9 keys: SELECT * FROM table
WHERE LEFT(geohash,6) IN ('dqcjqc',
'dqcjqf','dqcjqb','dqcjr1','dqcjq9','dqcjqd','dqcjr4','dqcjr0','dqcjq8');
Translated to a SimpleDB query that'd be where GeoHash6 in('dqcjqc', 'dqcjqf', 'dqcjqb', 'dqcjr1', 'dqcjq9', 'dqcjqd', 'dqcjr4', 'dqcjr0', 'dqcjq8') and then you'll do your Haversine filtering on the results in order to only get the items that's within your search radius.
I'm going to leave this here because it might help you!
14 years ago we tried to do a geo lookup table of locations within a radius. There was obviously no geospatial indexes or anything like that.
There was literally only standard SQL and Oracle... anyway, we ended up converting all lat/lng into kilometers from a fixed plane field. Essentially what geospatial indexes do these days.
To explain what exactly it does, it turns the world into a flat surface and with a bit of SQL trickery you can even select by radius, you even get the distance from the two points you're selecting. Since it's also raw full integers the queries are blazing fast.
Here is a simple example in PHP and a very complex looking but pretty easy once you understand it SQL query:
https://gist.github.com/tobsn/899413

When calculating a distance from a city, how can I factor in the approximate size (physical area) of the city?

I'm building a store locator based on in-house geocoding data. Effectively I need to query stories near City X or Zip Y within a certain radius. The data sets I'm working with are relatively comprehensive and include things such as population.
One issue is that large cities (Los Angeles for example) are many miles in radius so you could be within the city but miles from the coordinate we have loaded.
Is there a rule of thumb, or a free data feed which would list an approximate radius of a city, or perhaps even outlines of the city points?
Also, assuming I have a shape defining the city what calculation would I use to say "stores within X miles of this area"?
Why don't you use the zip codes and latitude/longitude of the stores, instead of the cities? You know the addresses of the stores, so use its zip code, look up the coordinates, and calculate the distance from the origin zip code. Then it wouldn't matter how big the city is, because big cities have many zip codes, but each store has its own zip code.
It would only be a problem for states with big zip codes like Texas, but then there is likely not more than 1 store per zip code anyways so not a big deal.
Ultimately we didn't implement this feature, but before it was cancelled I had a fair amount of success using the below approach:
Finding coordinates for the city itself, as well as all zip codes of the city
"Connecting the dots" of all the above coordinates to create a polygon of the (very rough shape of the city)
Checking if the user's input coordinate was within the given range of the polygon
The above approach worked relatively well and may have ultimately developed into a sound solution with some more enhancements and tuning.