openstreetmap response - importance and place_rank fields - geocoding

What is the meaning of the 'importance' and 'place_rank' fields in a openstreetmap response ? I can't find it anywhere in the documentation :/
For example the response of this url:
http://nominatim.openstreetmap.org/search?q=135+pilkington+avenue,+birmingham&format=xml&polygon=1&addressdetails=1
is:
<place place_id="62311100" osm_type="way" osm_id="90394480" place_rank="30" ...OMISSIS... importance="0.701">
In the above response I have removed all the XML part I'm not interested in.

As far as I know:
The importance is used for ordering search results according to their relevance. The importance value is calculated/estimated using various attributes including the place's popularity on Wikipedia and its rank.
The rank is based on a rather complex algorithm taking the place type and various other attributes into account. For example it seems checks whether this object is a village, a city, a country, a continent, a highway, a lake and similar other properties.
Unfortunately these attributes lack proper documentation. So all you can do is try to look at Nominatim's source code if you need more detailed information. From there I tried to extract the information mentioned above.

Related

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?
So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

SOLR query exclusions

I'm having an issue with querying an index where a common search term also happens to be part of a company name interspersed throughout most of the documents. How do I exclude the business name in results without effecting the ranking on a search that includes part of the business name?
example: Bobs Automotive Supply is the business name.
How can I include relevant results when someone searches automotive or supply without returning every document in the index?
I tried "-'Bobs Automotive Supply' +'search term'" but this seems to exclude any document with Bobs Automotive Supply and isn't very effective on searching 'supply' or 'automotive'
Thanks in advance.
Second answer here, based on additional clarification from first answer.
A few options.
Add the business name as StopWords in the StopWordFilter. This will stop Solr from Indexing them at all. Searches that use them will only really search for those words that aren't in the business name.
Rely on the inherent scoring that Solr will apply due to Term frequency. It sounds like these terms will be in the index frequently. Queries for them will still return the documents, but if the user queries for other, less common terms, those will get a higher score.
Apply a low query boost (not quite negative, but less than other documents) to documents that contain the business name. This is covered in the Solr Relevancy FAQ http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F
Do you know that the article is tied to the business name or derive this? If so, you could create another field and then just exclude entities that match on the business name using a filter query. Something like
q=search_term&fq=business_name:(NOT search_term)
It may be helpful to use subqueries for this or to just boost down rather than filter out results.
EDIT: Update to question make this irrelavent. Leaving it hear for posterity. :)
This is why Solr Documents have different fields.
In this case, it sounds like there is a "Footer" field that is separate from your "Body" field in your documents. When searches are performed, they would only done against the Body, which won't include data from the Footer. You could even have a third field which is the "OriginalContent" field, which contains the original copy for display purposes. You wouldn't search that, just store it for later.
The important part is to create the two separate fields in your schema and make sure that you index those field that you want to be able to search.

Difference between a post's likes count and the likes data?

I'm seeing a discrepancy between the number of likes reported in the Graph API vs the number of entries in the "data" that has the name and ID of the people who liked a post.
When I view a certain post on Facebook, I see that it has 5 people who have liked it.
When I use the Graph API to fetch the post, the "likes" field has a "data" field with 3 entries in it, and a "count" field whose value is 5.
When I use the Graph API to fetch the likes for the post (eg, {post_id}/likes), I get a "data" field with 5 entries in it (and no "count" field).
Clearly the true answer to how many people have liked the post is 5. But then why is there only 3 entries in the "data" when I fetch the post object?
Here's another example of the same discrepancy:
https://graph.facebook.com/40796308305_10150394134258306 returns data for a post whose "likes/data" only has 1 entry in it, but whose "likes/count" says that there are 3. But https://graph.facebook.com/40796308305_10150394134258306/likes returns "data" with 3 entries. Finding that same entry on Coca-Cola's page finds that there are, in fact, 3 people who have liked it.
The documentation of the post object doesn't mention that the likes list may be incomplete, and the documentation of the fql stream table explicitly says to use the post object to get the full list, so It's either a bug in the API or in the documentation.
I suspect it may be a deliberate but undesirable "feature" to limit the detailed list for performance reasons, as some posts may have hundreds or even thousands of likes.
It ends up actually causing a huge performance problem as I need to find all posts that have been liked by a particular user, and the only way to do that is to do a separate fetch of likes for each post in the list whose like count is higher than the like list length.
2 people have their privacy settings set to not show their name to people who are not their friends.

REST API question on how to handle collections as effective as possible while still conforming to the REST principles

Im pretty new to REST but as far as i have gathered i understand that the following URL's conform to the REST principles. Where the resources are laid out as follows:
/user/<username>/library/book/<id>/tags
^ ^ ^ ^
|---------|-----------|---|- user resource with username as a variable
|-----------|---|- many to one collection (books)
|---|- book id
|- many to one collection (tags)
GET /user/dave/library/book //retrieves a list of books id's
GET /user/dave/library/book/1 //retrieves info on book id=1
GET /user/dave/library/book/1/tags //retrieves tags collection (book id=1)
However, how would one go about optimizing this example API? Say for example i have 10K books in my library and i want to fetch the details of every book in my library. should i really force a http call to /library/book/<id> for every id given in /library/book? Or should i enable multiple id's as parameters? /library/book/<id1>,<id2>... and do like bulk fetching with a 100 id's at a time?
What does the REST principles say about this kind of situation? and what are your opinion(s)?
Thanks again.
This is strictly a design matter.
I could define a bookc resource and use it like this:
GET /user/dave/library/book?bookList=...
how do you further specify the bookList argument is really a matter of what kind of usage you envisage of this resource. You could have, e.g.:
GET /user/dave/library/book?bookList=1-10
GET /user/dave/library/book?bookList=1,2,5,20-25
or you could simply page through all of the books:
GET /user/dave/library/book?page=7&pagesize=50
But in my mind, especially the form with a long list of "random" ids seems pretty unfit. Maybe I would instead define a filter parameter so I can specify:
GET /user/dave/library/book?filter=key,value&filter=key,value
As to your question about HTTP URL length limit, the standard does not set any. But browser may vary... look at this S.O. topic
To be more strictly RESTful, the query parameter could be specified through HTTP headers, but the general idea I wanted to convey does not change.
Hope this seems suitable to you...
Above looks good, but I would change to plural names, it reads better:
/users/{username}/books/{bookId}
What I don't understand is the use-case of passing comma-separated list of ids. The question is how you get to the ids? I guess behind the list of ids there are semantics, i.e. they represent a result of a filter. So instead of passing ids I would go for a search api. Simplistic example:
/users/dave/books?puchasedAfter=2011-01-01
If you want to iterate through your 10K collection of books, use paging parameters.
this is just my opinion:
GET /user/dave/library/book/IDList //retrieves a list of books id's
or
GET /user/dave/library/bookID //retrieves a list of books id's
GET /user/dave/library/book //retrieves a list of books
GET /user/dave/library/book/1 //retrieves info on book id=1
GET /user/dave/library/book/1-3 //retrieves info on book id>=1 and id <=3
GET /user/dave/library/book/1/tags //retrieves tags collection (book id=1)
You can use a paginator
Some restful API's work with a paginator for huge resources like:
http://example.org/api/books?page=2
The server delivers for example 100 records (in this case books) per page. And you can sort the books using a sortby in your get request. With the above request you would get books 101-200 (if so many in the database). The response can tell you something about the amount of books and amount of pages, what is the next page and the previous page but then you go more to HATEOAS.
Otherwise if you want to get certain id's i would do it like this:
http://example.org/books?id=[]2&id=[]5&id=[]7&id=[]21
A get request with an array of id's (id = [2,5,7,21]) which returns the books with those respective id's

Sorting results in Advanced System Reporter in Sitecore

In Sitecore's Advanced System Reporter (v1.3) shared source module, is there an out-of-the-box way of sorting the results before the results are displayed to email/screen or will I need to implement something myself?
In a standard ASR install, I can see the Media Viewer viewer configuration item has a sort parameter in the attributes field but it's using ASR.Reports.Items.ItemViewer class which, after checking in reflector, doesn't respect the sort parameter. I take this to mean that the class might have respected the sort parameter previously but doesn't now.
As a side thought, I would have thought that a Scanner class would be a much more logical place to put sorting logic than at the Viewer class level.
Ok, found the answer. The sort parameter I found is actually used when running the report by the ASR module.
The sort parameter is set up in the attributes and is in the following format:
sort=ColumnName,ASC|DESC,[DateTime]
where Column Name is the display name of the column, ASC or DESC is the sort direction and is required and DateTime is to be set if the column is a date time value.
Example:
Given the column formatting of
<Columns>
<Column name="item name">Item Name</Column>
<Column name="publish date">Publish Date</Column>
</Columns>
to sort by publish date descending, the appropriate sort parameter would be
sort=Publish Date,DESC,DateTime
and to sort by item name, the sort parameter would be
sort=Item Name,ASC
I'm not sure anyone can answer your question immediately, apart from probably the module author. But you have a huge advantage in this case - the module sources. Instead of browsing the assemblies with the Reflector, you can check out the latest sources and just debug it. One debug session can answer more questions than a bunch of SO posts. ;-)
Also, as a side note, you might have noticed special Sitecore logos on that page - this blog post will tell you what it means.