Why would order of search terms affect solr query results?

Why would order of search terms affect solr query results? - coldfusion

If I search for Authorname "Title of Work" the records don't come up, but if I search for "Title of Work" Authorname then they do.
Why might this happen?
This is solr running on Coldfusion. The only change is the order of the terms.
Update
Sample coldfusion code. Note that in this example first one gets 2 matches while the second one gets 1. So it looks like this changes depending on the actual string searched, but it still means that changing the order of terms changes the number of records returned.
I could understand it changing the order of records returned, since changing the order would change the relevance of the results. But all 3 records should show up for either one. I'll see if I can find the solr logs and post them, that might help.
<cfset term1='"globalization of information"'>
<cfset term2='Reiter'>
<cfsearch name="ExampleOne" criteria='#term1# #term2#' collection="abstracts,fulltexts">
<cfoutput>#ExampleOne.recordcount#</cfoutput>
<cfsearch name="ExampleTwo" criteria='#term2# #term1#' collection="abstracts,fulltexts">
<cfoutput>#ExampleTwo.recordcount#</cfoutput>
<cfabort>
Output:
2 1

Just try giving search term in single quotes, I have tested in on CF 10 and it is working fine for me.
So Instead of:
cfset term1='"globalization of information"'
try this
cfset term1="'globalization of information'"

Related

How can I resolve INDEX MATCH errors caused by discrepancies in the spelling of names across multiple data sources?

I've set up a Google Sheets workbook that synthesizes data from a few different sources via manual input, IMPORTHTML and IMPORTRANGE. Once the data is populated, I'm using INDEX MATCH to filter and compare the information and to RANK each data set.
Since I have multiple data inputs, I'm running into a persistent issue of names not being written exactly the same between sources, even though they're the same person. First names are the primary culprit (i.e. Mary Lou vs Marylou vs Mary-Lou vs Mary Louise) but some last names with special symbols (umlauts, accents, tildes) are also causing errors. When Sheets can't recognize a match, the INDEX MATCH and RANK functions both break down.
I'm wondering how to better unify the data automatically so my Sheet understands that each occurrence is actually the same person (or "value").
Since you can't edit the results of an IMPORTHTML directly, I've set up "helper columns" and used functions like TRIM and SPLIT to try and fix instances as I go, but it seems like there must be a simpler path.
It feels like IFS could work but I can't figure how to integrate it. Also thinking this may require a script, which I'm just beginning to study.
Here's a simplified example of what I'm trying to achieve and the corresponding errors: Sample Spreadsheet
The first tab is attempting to pull and RANK data from tabs 2 and 3. Sample formulas from the Summary tab, row 3 (Amelia Rose):
Cell B3: =INDEX('Q1 Sales'!B:B, MATCH(A3,'Q1 Sales'!A:A,0))
Cell C3: =RANK(B3,$B$2:B,1)
Cell D3: =INDEX('Q2 Sales'!B:B, MATCH(A3,'Q2 Sales'!A:A,0))
Cell E3: =RANK(D3,$D$2:D,1)
I'd be grateful for any insight on how to best index 'Q2Sales'!B3 as the correct value for 'Summary'!D3. Thanks in advance - the thoughtful answers on Stack Overflow have gotten me this far!

to counter every possible scenario do it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(LOWER(REGEXREPLACE(A2:A, "-|\s", )),
{REGEXEXTRACT(LOWER(REGEXREPLACE('Q2 Sales'!A2:A, "-|\s", )),
TEXTJOIN("|", 1, LOWER(REGEXREPLACE(A2:A, "-|\s", )))), 'Q2 Sales'!B2:B}, 2, 0)))

CloudSearch fuzzy matching of whole string doesn't work

I have set up an Amazon CloudSearch domain with records that hold addresses. I want to do a fuzzy text search on an address field.
Say I have a record with the following address:
1600 Amphitheatre Parkway, Mountain View, CA 94043.
If I search for 'Amphitheatre Parkway, Muntain View'~5 I get no results. I basically deleted the 'o' in "Mountain" and it doesn't find any results.
If I search for Muntain~5 it finds it, but again if I search for Miunntain~5 it doesn't find anything.
I should add I created a free text Analysis Scheme, with no stemming, stopwords or synonyms. This is what is used for the address field which is of type text.
How should I set up CloudSearch to be able to do these sort of queries?

Querying 'Amphitheatre Parkway, Muntain View'~5 is actually performing a fuzzy/sloppy phrase search, where it's searching for those words within 5 words of one another. I don't think that's what you intended.
The Miunntain~5 query is really interesting: it does indeed return no results, but miunntain~5 (lowercase m) does:
I did notice that switching between lower and uppercase in my queries does slightly affect the match scores, so perhaps the capital M just makes it too weak a match. I don't have a good explaination for that; it's certainly counterintuitive so maybe it is a bug.
Finally your actual question about setting up CloudSearch to handle those queries: unfortunately CloudSearch doesn't expose the "Did you mean..." spellcheck feature from Solr so there isn't really a good way to do this; slapping some tildas on things is about the best you can do.
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html

How to handle searches for very common keywords

I want to be able to return useful records if a user searches for a keyword that is very, very common in a solr index. For example education.
In this case, close to 99% of the records would have that word in it. So searches for this word or similar take a long time.
This is for solr on ColdFusion but I'm open to solutions which are isolated to just solr.
Right now I'm thinking of coming up with a list of stopwords and preventing those searches from taking place altogether.

If searches are taking a long time, it could be because you are not limiting the number of results that are returned. The <cfsearch> tag has a maxrows attribute, as well as a startrow attribute, that you could use to limit or paginate the data. Alternately, you could call Solr's web service directly through a <cfhttp> call:
<cfhttp url="http://localhost:8983/solr/<collection_name>/select/?q=<searchterm>&fl=*,score&rows=100&wt=json" />
Solr will return 10 rows by default; you can change this with the rows parameter. You can use the start parameter as well (note that Solr starts counting with 0 instead of 1). I believe this solution is more flexible, especially if you're using CF 9, as it allows you to paginate while sorting on a field other than score.
You can find more detail here:
http://www.thefaberfamily.org/search-smith/coldfusion-solr-tutorial/

If the user searches on just one term that is exceedingly common then you need to limit your results and advise the user that there were too many matches.
In the more general case, you want to perform a two-pass (at least) approach. Take your search terms and perform a lookup to determine their 'common-ness'. You want to filter based on least common terms first, and more common terms last.
For example, user searches serendipitous education. You identify that you have 11 matches for serendipitous, and 900000 matches for education. Thus you apply the serendipitous filter first, resulting in 11 matches. Then apply the education filter, resulting in 7 matches.
The key to fast searching is indexing and precomputed statistics. If you have statistics like this on hand you can dynamic create an optimised approach.

Getting values from CFLOOP

I am trying to pull out values from a CFLOOP and dump them but I seem to be missing something.I need to extract openHours from the first loop and openMinutes from the second and put them in variables that will then run a query for submitting the values in the database.
This is my struct when I dump out #form#. I need to get the variable form.openHours1 the problem is that openHours gets its number by #CountVar# so basically i need to dump out something like #form.openHours[CountVar]#
struct
FIELDNAMES POSTITNOW,OPENHOURS1,OPENHOURS2,OPENHOURS3,OPENHOURS4,OPENHOURS5,OPENHOURS6,OPENHOURS7
OPENHOURS1 13
OPENHOURS2 13
OPENHOURS3 12
OPENHOURS4 0
OPENHOURS5 0
OPENHOURS6 0
OPENHOURS7 0
POSTITNOW YES

Rather than #form.openHours[CountVar]# what you want is:
form["openHours" & CountVar]
As a scope, FORM is also a struct and you can use array notation to get at the values.
This is key for working with dynamic form field names.
To clarify:
form.openHours7
is equivalent to
form["openHours7"]
The first is generally known as dot-notation, the second as array-notation (since it resembles how you refer to array elements.
Since the value in the bracket is a string, you can replace it with a variable.
<cfset fieldToUse = "openHours7">
<cfoutput>#form[fieldToUse]#</cfoutput>
Or, as I opened with, a combination of a literal string and a variable.
You can't really do that with dot-notation. (At least not without using evaluate(), which is generally not recommended.)
The documentation has lots of information on how to work with structures, including the different notation methods.

I think you want this, or something very similar:
<cfoutput>
<cfloop from="1" to="7" index="CountVar">
#openHours[CountVar]#<br>
</cfloop>
</cfoutput>

Sorry, this is a little murky to me, but that's never stopped me from jumping in. Are you going to have an equal number of openhours and openminutes? Can you just loop over form.fieldnames? As it stands now, you have fields named openhours1-N, it sounds like openminutes1-N is yet to be added. It seems that you could loop over fieldnames, if the field starts with openhours you get the number off the end and then you can easily create the corresponding openminutes field. As Al said (much) earlier, you would then most likely use array-notation to get the values out of the form structure.
Another thought is that form field names don't have to be unique. If you had multiple occurrences of "openhours", ColdFusion would turn that into a list for you, then you could just loop over that list.

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047

Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.

latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js