How to compare two web objects? - compare

How to compare two web objects, objects may be web sites, profiles, web documents, etc.?
pseudo code function:
function compareWebObjcets(url1,url2){
float distance=1.0;
...
return distance; // value from 0.0 - objects are similar, 1.0 - max different.
}

Related

Elasticsearch scoring on multiple indexes: dfs_query_then_fetch returns the same scores as query_then_fetch

I have multiple indices in Elasticsearch (and the corresponding documents in Django created using django-elasticsearch-dsl). All of the indices have these settings:
settings = {'number_of_shards': 1,
'number_of_replicas': 0}
Now, I am trying to perform a search across all the 10 indices. In order to retrieve consistent scoring between the results from different indices, I am using dfs_query_then_fetch:
search = Search(index=['mov*'])
search = search.params(search_type='dfs_query_then_fetch')
objects = search.query("multi_match", query='Tom & Jerry', fields=['title', 'actors'])
I get bad results due to inconsistent scoring. A book called 'A story of Jerry and his friend Tom' from one index can be ranked higher than the cartoon 'Tom & Jerry' from another index. The reason is that dfs_query_then_fetch is not working. When I remove it or substitute with the simple query_then_fetch, I get absolutely the same results with the identical scoring.
I have tested it on URI requests as well, and I always get the same scores for both search types.
What can be the reason for it?
UPDATE: The results are actually not the same, but they are only really slightly different, e.g. a score of 50.1 with dfs and 50.0 without dfs, while the same model within one index has a score of 80.0.
If the number of shards is 1, then dfs_query_then_fetch and query_then_fetch will return the same result. DFS query will do a query to all shards and then show you results based on the scores computed, but in this case there is only one shard.
Regarding the scoring, you might wanna have a look at your actors field too. Also, do let us know what are the analyzer and tokenizer if you have used custom ones?

Own simple load balancer for dynamic chances / probabilities (in C++ but language undependent)

H_ello lovely people,
my program is written as a scalable network framework. It consists of several components that run as individual programs. Additional instances of the individual components can be added dynamically.
The components initially register with IP and Port at a central unit. This manager periodically sends to the components where other components can be found. But not only that, each component is assigned a weight / probability / chance of how often it should be addressed compared to the others.
As an example: 1Master, Component A, B, C
All Components registered at Master, Master sends to A: [B(127.0.0.1:8080, 3); C(127.0.0.1:8081. 5)]
A runs in a loop and calculates the communication partner over and over again from this data.
So, A should request B and C in a 3 to 5 ratio. How many requests each one ultimately gets depends on the running performance. This is about the ratio.
Of course, the numbers 3 and 5 come periodically and change dynamically. And it's not about 3 components but potentially hundreds.
My idea was:
Add 3 and 5. Calculate a random number between 1 and 8. If it is greater than 3, take C else B ....
But I think that's not a clean solution. Probably computationally intensive in every loop. In addition, management and data structures are expensive. In addition, I think that a random number from the STL is not balanced enough.
Can someone perhaps give me a hint, how I implemented this cleanly or does someone have experiences with it or an idea?
Thank you in every case;)
I have an idea for you:
Why not try it with cummulative probabilities?
1.) Generate a uniformly distributed random number.
2.) Iterate through your list until the cumulative probability of the visited element is greater than the random number.
Look at this (Java code but will also work in C++), (your hint that you use C++ was very good!!!)
double p = Math.random();
double cumulativeProbability = 0.0;
for (Item item : items) {
cumulativeProbability += item.probability();
if (p <= cumulativeProbability) {
return item;
}
}

Excel IF function and in between values, but only if

I have values for postage, pricing and postage service (only if). I have two choices for postage service (express and eco), price depends on a weight, but service depends on a price (fast service for items over £5, eco - under).
Service: if product price(A2)
<5=eco; >5=express
Service price(C2) by weight(B2):
<=1000gr= £2 eco or £3 express
1001-1250gr= £5 eco or £6 express
1251-5000gr=£9 eco or £11 express
Cells A2 and B2 always display a value, need a formula for C2 to display the price of service calculated by weight, but if item over £5 must display express service price if less - eco.
I have tried:
>IF(AND(OR(B2<=1000),A2<5),2,IF(AND(OR(B2>1000,B2<=1250),A2<5),5,IF(AND(OR(B2>1250,B2<=5000),A2<5),9)))
>IF(AND(OR(B2<=1000),A2<5),2)+IF(AND(OR(B2>=1001,B2<=1250),A2<5),5)+IF(AND(OR(B2>2000),A2<5),9)
Didn't start adding A2>5, because nothing works anyway! Tried many more, but no luck.
Would appreciate any help because stuck and ran out of options :(
Thanks!
There are a couple of ways to accomplish this. The preferred method is to build a small cross-reference table for your surcharges and use the VLOOKUP function to return the values.
However, this question was about hard-coded values in a conditional statement, so I will address that with a LOOKUP function and arrayed constants.
The standard formula in C2 is,
=LOOKUP(B2,{0,1001,1251},{2,5,9})+SIGN(A2)*LOOKUP(B2,{0,1001,1251},{1,1,2})
Fill down as necessary.
In the following image, custom number formats were used on columns A and B ([Color9]\Exp\r\e\s\s - [$£-809]#,##0.00;;[Color10]\Eco - [$£-809]#,##0.00; and 0\g\r_)). Weights >5000 in column B trigger a conditional formatting in column C that displays too heavy.
    

Anybody know a good way to do offline reverse geocoding?

I need to do offline reverse geocoding, meant to run as cron-jobs. Reverse coder should give city, state for millions of 'records'.
What services exist ( paid or free ) that can do 'it'?
I've looked at http://www.geonames.org/, but that's not exactly what I need.
Get a table coordinates of all cities (and which state the city is in). For any coordinate, pick the closest city. For efficient use, encode the table as an R-tree.
For links on R-tree etc, see Reverse Geocoding Without Web Access
Get cities table from geonames, see Given the lat/long coordinates, how can we find out the city/country?
This question is quite old, but I have found another solution besides the popular R-Tree-based solution.
Treap the map as an image, you could run-length-encode the map into a (sorted) list of (x, y1, y2, loc) tuples with the desired resolution. To find the location of a point (a, b), simply find the row where x = a and y1 <= b <= y2.
This solution is efficient as the map has a fixed resolution. Compare to the R-Tree-based solution, It has the advantage of not introducing new libs into existing systems. The data can be efficiently queried by a regular SQL DB as long as (x, y1) is indexed, or queried by running binary searches on a static file.
There is a website (https://reverse-geocoding.com/) that offers such database and free demos on how this works. Disclosure: I'm working on this site.

Finding min distance between several points, or merging querysets with extra selects

I have a bunch of objects with location attributes (PointFields). I have two special locations, and I want to know which of those locations each object is closest to and how far that is. That is, I'd like to do something like:
q0 = q.distance(p0).extra(select={'dist_from': p0})
q1 = q.distance(p1).extra(select={'dist_from': p1})
qq = take_obj_with_min_distance(q0, q1)
(The actual query will do some stuff with bboverlaps and location__distance_lt, possibly involve more than two special locations, and possibly objects with multiple location attributes. Nevertheless, I think a solution to the above will handle all that other stuff.)
Afterwards, qq should have the same elements as q, but each element has a distance attribute and a dist_from attribute, where the distance attribute is the minimum of the distance from p0 and the distance from p1, and dist_from is the point with which it achieves that minimum.
Can I do this? Is it healthy for children and other living things?
I considered merging the queries and doing this stuff with a list, but of course you can't merge queries with extra select values (such as are introduced by distance queries). Also, I'll want to filter qq some more afterwards.
This page will give you the required code in a bunch of languages: http://www.codecodex.com/wiki/Calculate_Distance_Between_Two_Points_on_a_Globe
If you don't have too many object you might wish to do it in Python, but if you prefer querying the database it might be the best to prepare a procedure or function in SQL.