I want to find "Radiohead" but not "Radiohead's" with Sunspot/Solr - regex

I'm using solr via the sunspot gem in a rails project.
I am indexing scraped data.
My indexing is currently done like so:
searchable do
text :title, :boost => 3.0 do
title.gsub(/\'s\b/, "")
end
text :mentions do
mentions.map do |mention|
mention.title.gsub(/\'s\b/, "")
end
end
end
Currently, if I do:
Video.solr_search { fulltext '"Radiohead"' }
Solr will return results with:
Radiohead's
and
Radiohead
I would like to only find:
Radiohead
Is there a way to do this via Sunspot?

Check what filters you have defined in the analyzer section of the field type for your field in schema.xml (in .../solr/conf directory). Here's an example:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
...
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
The behaviour you're seeing is called "stemming" - it's where the indexed value is the stem of the word, rather than the word itself. eg, "fly", "flies", "flew" and "flying" would all be indexed as "fly". If there's a filter like snowball (apache's stemmer), then you'll get the behaviour you're seeing. Try removing the filter, restarting solr then reindexing your documents.

You should do a phrase query (using double quotes) :
Video.solr_search { fulltext '"Radiohead"' }.
Or modify you solr schema.xml so that you don't split "Radiohead's". I don't know your field configuration here so I can't provide more details...

Related

Getting rid of plaintext hyperlinks before indexing a record in Solr

I have a field, whose content is used to generate facets from. One particular problem I'd like to solve is the fact that some of my content contains hyperlinks in plaintext i.e http://google.com. As a result, I started seeing http as one of my top facets. How can I make sure that I filter out the hyperlink content, before I index it? Using a regex filter of some sort?
I know that I can do this pre-processing part on the client side, when I add the records to Solr. Yet, I'd like to keep everything consistent, and part of the Solr pipeline, so I'd like the Solr pre-processor to do this for me if possible.
I would solve it with these components:
The solr.UAX29URLEmailTokenizer preserves the URL as a token
The solr.PatternReplaceFilterFactory replaces the URL token with an empty string (search Stack Overflow for a suitable regex pattern)
A solr.LengthFilterFactory filters the zero-length token
In schema.xml:
<analyzer type="index">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="..." replacement="" />
<filter class="solr.LengthFilterFactory" min="1" max="1000" />
</analyzer>
Note that changing the tokenizer from the solr.StandardTokenizerFactory may have implications beyond what is described in this answer, so be sure to test.

Sitecore 7 Content search .ToLower() method not supported, looking for alternatives

I am working on sitecore 7.2 content search, I want to compare the document title with a string but In lowercase, When I use .ToLower() method in search clause, I get error that .ToLower() method is not supported, exact error is :
8648 11:19:34 ERROR Unsupported string method: ToLowerInvariant.
8648 11:19:34 ERROR at Sitecore.ContentSearch.Linq.Parsing.ExpressionParser.VisitStringMethod(MethodCallExpression methodCall)
Is there any way to do case insensitive string comparison?
you dont need to apply ToLower() - the search is per default using case insensitive search for text fields
Make sure the lucene analyser type is
Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer
Sample index field configuration is
<field fieldName="subject" storageType="YES" indexType="TOKENIZED" vectorType="NO" boost="0.3f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">
<analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" />
</field>

Set Ant property based on a regular expression in a file

I have the following in a file
version: [0,1,0]
and I would like to set an Ant property to the string value 0.1.0.
The regular expression is
version:[[:space:]]\[([[:digit:]]),([[:digit:]]),([[:digit:]])\]
and I need to then set the property to
\1.\2.\3
to get
0.1.0
I can't workout how to use the Ant tasks together to do this.
I have Ant-contrib so can use those tasks.
Based on matt's second solution, this worked for me for any (text) file, one line or not. It has no apache-contrib dependencies.
<loadfile property="version" srcfile="version.txt">
<filterchain>
<linecontainsregexp>
<regexp pattern="version:[ \t]\[([0-9]),([0-9]),([0-9])\]"/>
</linecontainsregexp>
<replaceregex pattern="version:[ \t]\[([0-9]),([0-9]),([0-9])\]" replace="\1.\2.\3" />
</filterchain>
</loadfile>
Solved it with this:
<loadfile property="burning-boots-js-lib-build.lib-version" srcfile="burning-boots.js"/>
<propertyregex property="burning-boots-js-lib-build.lib-version"
override="true"
input="${burning-boots-js-lib-build.lib-version}"
regexp="version:[ \t]\[([0-9]),([0-9]),([0-9])\]"
select="\1.\2.\3" />
But it seems a little wasteful - it loads the whole file into a property!
If anyone has any better suggestions please post :)
Here's a way that doesn't use ant-contrib, using loadproperties and a filterchain (note that replaceregex is a "string filter" - see the tokenfilter docs - and not the replaceregexp task):
<loadproperties srcFile="version.txt">
<filterchain>
<replaceregex pattern="\[([0-9]),([0-9]),([0-9])\]" replace="\1.\2.\3" />
</filterchain>
</loadproperties>
Note the regex is a bit different, we're treating the file as a property file.
Alternatively you could use loadfile with a filterchain, for instance if the file you wanted to load from wasn't in properties format.
For example, if the file contents were just [0,1,0] and you wanted to set the version property to 0.1.0, you could do something like:
<loadfile srcFile="version.txt" property="version">
<filterchain>
<replaceregex pattern="\s+\[([0-9]),([0-9]),([0-9])\]" replace="\1.\2.\3" />
</filterchain>
</loadfile>

Ant pass specific string values from one file to another

I'm trying to configure autoincrement of version number in an Android aplication.
I've configured Ant tasks which make a checkout of the file then autoincrement the string with version value and then automatically do check-in in TFS system. So this test.properties file has this contetnt:
versionCode="1"
Autoincrement did this task:
<propertyfile file="${basedir}/src/test.properties">
<entry key="versionCode" value="1" type="int" operation="+" />
</propertyfile>
How would I configure replacement of the value of this string: android:versionCode="1", located in the target file androidmanifest.xml?
I suppose you can use the ReplaceRegExp task in Ant:
Directory-based task for replacing the occurrence of a given regular expression with a substitution pattern in a file or set of files.
It might look something like this:
<replaceregexp file="${src}/androidmanifest.xml"
match="android:versionCode="[0-9]+""
replace="android:versionCode="${versionCode}""
/>
The build.number property could be obtained by reading in the property file before running this task.
A solution with vanilla ant, no Ant addon needed, you may use a propertyfiletemplate that has =
f.e. named props.txt
...
android:versionCode=#versionCode#
...
and later on use the property ${versionCode} set by your workflow
and create a propertyfile from your template with copy + nested filterset =
<!-- somewhere set in your workflow maybe via userproperty -DversionCode=42 -->
<property name="versionCode" value="42"/>
...
<copy file="props.txt" todir="/some/path" overwrite="true">
<filterset>
<filter token="versionCode" value="${versionCode}"/>
</filterset>
</copy>
/some/path/props.txt will have =
...
android:versionCode=42
...
see Ant Manual on filtersets

Solr - equivalent to LIKE in SQL

I've been playing around with creating a collection in Apache Solr via ColdFusion 9 from a database result set. I would like to do a search which would be as follows in SQL:
select * from events where eventName like 'Meet%'
In SQL this will match partially on a word and return the row. I am trying to do this using a Solr collection and <cfsearch> in CF like so:
<cfsearch collection="#myCollection#" criteria="Meet*" name="results" />
However I am not getting the data back unless I specify the full word, despite the use of the wildcard. The docs say the wildcard is not allowed at the start of the search but it doesn't say it's not allowed at the end. In fact for me it doesn't work anywhere!
<!--- No results -->
<cfsearch collection="#myCollection#" criteria="Meet*" name="results" />
<!--- No results -->
<cfsearch collection="#myCollection#" criteria="Meet*g" name="results" />
<!--- No results -->
<cfsearch collection="#myCollection#" criteria="Meeti?g" name="results" />
<!--- Yes - results! -->
<cfsearch collection="#myCollection#" criteria="Meeting" name="results" />
Has anyone implemented a wildcard Solr search using <cfsearch>? If so can you point me in the right direction on this please?
Try "meet*" rather than "Meet*". I've found that wildcards will only work with lower case strings, so whenever a search query contains an asterisk I LCase() the string before passing it to Solr.
Have you had a look at this post about wildcard searching in Solr? As long as you are using the correct query parser, i.e. one that supports wildcard queries, then you should be able to do the 'Meet%' query using 'Meet*'.