I have recently added search capabilities to my django-powered site to allow employers to search for employees using keywords. When the user initially uploads their resume, I turn it into text, get rid of stop words, and then add the text to a TextField for that user. I used Django-Haystack with the Whoosh search back engine.
Three things-
1) Aside from extra features which I'll probably not use, is there any concrete advantage to switching to Solr or Xapian?
2) In turning the resume into text, I essentially index the pdf myself. I know both Xapian and Solr support .pdf indexing, however, from the looks of it Haystack does not. Any tips on how to get around this? Or should I keep indexing it myself? If so, should I be doing more than simply providing a text file of keywords?
3) Whoosh only return a result if the keyword matches itself exactly. If a user has 'mathematics' as his keyword, and I search 'math', I want that user to appear. I couldn't definitively tell whether Xapian or Solr support this. Thoughts?
Thanks for any suggestion. I'm going to continue digging into this myself for the time being.
Unfortunately I don't know enough to answer your other questions, however for point 3.) Whoosh actually does support this.
You would have to use the autocomplete function of SearchQuerySet.
Detailed here:
http://docs.haystacksearch.org/dev/autocomplete.html
I'm currently using Whoosh and matching on partial matches myself.
Related
Scenario:
In Google Analytic, I notice that it is possible to replace certain URI parameter to words that you want by using search and replace filter like the following example below.
e.g. www.example.com/abc/product_id=3 -----> www.example.com/abc/product_name=shampoo
Problems:
Currently I've got a list of over 1000 products in my hand, instead of creating 1000 search and replace filter, what would be the most efficient and maintainable way to go solve the problem?
I've done some digging and notice that custom dimension could be the solution, however it would require me to modify the the JS code on the FTP sever which I dont have permission on. What other solutions do I have?
If it is not possible to show it here would there be any kind of tutorial that I could follow through?
Really appreciate for the help, Many Thanks
This is not a complete answer, but it's certainly more than a comment.
Besides the tedium of writing this out by hand, I can think of two options available to you.
Firstly, you could use the Google Analytics Management API (https://developers.google.com/analytics/devguides/config/mgmt/v3/). By constructing a set of commands, you could quickly iterate through your list and create the required 1,000 search and replace filters.
Secondly, if you were to use Google Tag Manager you would be able to create a Custom JavaScript Variable that takes the page path and compares it to your list. This variable could then replace the Page field before the hit data is sent to Google Analytics. This may sound more complicated, but it would allow you to pull your solution out of Google Analytics and into the flexible world of JavaScript.
Note that if you rewrite the product_id to a product_name once, you will have to maintain that cross reference every day and keep it in sync with what appears on the website -- make sure you have an automated solution or it will quickly get out-of-sync and be more of a mess than before.
An alternative is to do the search-and-replace on the reporting side.
I know Analytics Edge or Analytics Canvas products could easily do this, or you could just download into Excel or Google Sheets and do a series of lookup formulae.
I want to add the functionality that can pop up the similar results with every search query the user input.
I would be using it locally, so no google haystack search or something.
Not sure I follow: to get anything meaningful, especially as someone new to the language, you're going to need to use an external search package. If you're uncomfortable setting up something like Elasticsearch locally, you can start with Whoosh which can be installed with pip. I would highly recommend using Haystack as it abstracts away what you use under the covers to make it friendlier to work with and allows you to swap out for something stronger than Whoosh in the future. Here's a list of back-ends: they all support the “More Like This” functionality. If you're insistent on not using Haystack, here's a previous answer about how to get started in Whoosh.
I'm just wondering what is exactly the functionality that haystack provides and if I need it.
I mean the search and indexing is done by whoosh. As far as I can tell, haystack is just offering ready made views, and forms. If I want to write my own form and views do I still need haystack?
Am I missing something?
P.S. I don't plan to use any other search engine than whoosh so I also don't need haystacks's multiple search engine wrapping.
Besides views, forms and a search engine-agnostic layer, the other powerful thing about Haystack is its ability to map Django models to something the search index understands. Using Haystack, you can easily specify which fields in a model should be indexed and how (see the SearchIndex API - http://django-haystack.readthedocs.org/en/latest/searchindex_api.html).
Once you have done that, you can then leverage the built-in management commands to (re)index your data when required.
It also comes with some nice templatetags to help present search results, like highlighting the matching bits.
Is there a particular reason that you don't want to use Haystack? It is a pretty non-intrusive plugin that lets you use as much of it as you need, and makes it easy to use more advanced functionality when you need it later down the road. In one of the sites I built, I only used the SearchIndex and SearchQuerySet APIs; I built my own views and forms. Ultimately, if you end up writing your own indexing and searching code, views and forms, you have basically re-written a large part of Haystack, in which case, you may want to consider using something that is in use out there and reasonably well tested.
That said, I have rolled my own 'Haystack' like layer in another project, mainly because the data source didn't map to the Django ORM. In that case, I wrote my own indexing scripts, and used PySolr to interface with my Apache Solr instance.
Given that Whoosh is written in Python, I'd assume it has a decent Python interface, so it shouldn't be too hard to do. I would only do it if there's something special about your scenario though.
I have an application that lists jobs within a certain location using spatial search. It is a fairly simple search with a few filters (date range, job type, etc) no large text to search. I was considering using something like Haystack with solr to do the search, is it worth the overhead or should I just query the database?
This sort of thing can be easily handled via Solr (or any of Haystack's other backends), but if you start with your database (see Django Filter for ideas to make this easy via URLs), and then shift to a search engine when the need arises (based on load), you'll thank yourself later for not introducing more complexity early on.
When you do add the search engine, whichever you use, definitely use Haystack as the API, unless you go with Sphnix, in which case maybe see this blog post.
I have a request to return a list of the most popular search terms used when searching a Sitecore site.
I have no idea how to implement this sort of function using Sitecore or whether Sitecore has this kind of functionality all ready. I can't find any documentation detailing this.
I am currently using search based of the LuceneSearch module (http://trac.sitecore.net/LuceneSearch) but altered to bind to a ListView for easy pagination.
At the moment I am probably just going to build a standalone function/class to update an XML file or something unless someone is able to point me in the correct direction...?
I would frankly use OMS for that - this is what it is designed to do. No need of separate database. Just register the search events via API with OMS. There is an out of the box Search report. May require some tweaking, but this seems to be the most out of the box solution.
Take a look here for more details.
I don't know of any standard functionality in Sitecore that would help you achieve this, so you will probably have to approach this from ground up - unless someone else in here is able to point to a package deal somewhere :-)
Solving this, really breaks down into two tasks
1) Collecting search term information. Whenever a user enters a search term in the searchbox that I assume you have; normalise it and store it in a SQL table (essentially a [term] [count] type table. Update the counter on terms you already store.
By normalising, I mean lowercasing it and so on - possibly breaking each search term (word) down and storing them one for one if that is what your solution calls for (probably not the route I would go)
2) Realtime retreiving information from the table, based on what the user is typing in the searchbox. Assuming you want some sort of "amazon-like" - also found on almost all major search engines nowadays - autocompletion. I normally implement these in a web service that then gets called by Ajax, JQuery or whatever rich client implementation you prefer.
As for updating an XML file, I think locking issues and performance would kill that solution; though it could perhaps be made to work on a very small scale.
Sorry that I can't be more specific in my response, but your question is very open-ended.
Very interesting question. One thing you could do it have another database to store these search queries. An insert into this DB would not be very difficult and would get around the issue of locking on a XML file. Maybe insert the search query into a DB table then to get the top results just pull the top x rows ordered by that query field. As Mark Cassidy said before, maybe normalize the data before inserting it.
You could isolate this work on your search layout (or sublayout) so it runs on a specific part of the site, not on every page.
Sitecore has an out of the box "site search" report in the executive insight dashboard, this will give you an indication of what search terms are driving the most visits and of course engagement value.
You just need to configure it by registering a page event on the search page and passing the query otherwise sitecore wouldnt know what form field constitutes a search. See this post it explains it in more detail. For more information you can download the analytics configuration reference document from sdn.http://sdn.sitecore.net/upload/sitecore6/65/engagement_analytics_configuration_reference_sc65-usletter.pdf
And dont forget for performance sitecore caches the reports at various levels so during development it may be handy to know how to force a cache update, I talk about this in the following blog post:
http://andytsitecore.blogspot.co.uk/2013/10/sitecore-dms-and-analytics.html