Solr - equivalent to LIKE in SQL - coldfusion

I've been playing around with creating a collection in Apache Solr via ColdFusion 9 from a database result set. I would like to do a search which would be as follows in SQL:
select * from events where eventName like 'Meet%'
In SQL this will match partially on a word and return the row. I am trying to do this using a Solr collection and <cfsearch> in CF like so:
<cfsearch collection="#myCollection#" criteria="Meet*" name="results" />
However I am not getting the data back unless I specify the full word, despite the use of the wildcard. The docs say the wildcard is not allowed at the start of the search but it doesn't say it's not allowed at the end. In fact for me it doesn't work anywhere!
<!--- No results -->
<cfsearch collection="#myCollection#" criteria="Meet*" name="results" />
<!--- No results -->
<cfsearch collection="#myCollection#" criteria="Meet*g" name="results" />
<!--- No results -->
<cfsearch collection="#myCollection#" criteria="Meeti?g" name="results" />
<!--- Yes - results! -->
<cfsearch collection="#myCollection#" criteria="Meeting" name="results" />
Has anyone implemented a wildcard Solr search using <cfsearch>? If so can you point me in the right direction on this please?

Try "meet*" rather than "Meet*". I've found that wildcards will only work with lower case strings, so whenever a search query contains an asterisk I LCase() the string before passing it to Solr.

Have you had a look at this post about wildcard searching in Solr? As long as you are using the correct query parser, i.e. one that supports wildcard queries, then you should be able to do the 'Meet%' query using 'Meet*'.

Related

remove some code from the url pagination

i have the following code, but i am very loose in the regular expression, i am using coldfusion
and i want to remove the code which is inbetween before every next page call
http://beta.mysite.com/?jobpage=2page=2#brands
what i am trying is if jobpage exists, it should remove the jobpage=2 from the URL, {2} is dynamic as it can be one or 2 or 3 and so on.
I tried with listfirst and listlast or gettoken but no help.
This should do it for you
<Cfset myurl = "http://beta.mysite.com/?jobpage=2page=2##brands" />
<cfoutput>#myurl#</cfoutput><br><Br>
<cfset myurl = ReReplaceNoCase(myurl,"(jobpage=[0-9]+[\&]?)","","ALL") />
<cfoutput>#myurl#</cfoutput>

Getting rid of plaintext hyperlinks before indexing a record in Solr

I have a field, whose content is used to generate facets from. One particular problem I'd like to solve is the fact that some of my content contains hyperlinks in plaintext i.e http://google.com. As a result, I started seeing http as one of my top facets. How can I make sure that I filter out the hyperlink content, before I index it? Using a regex filter of some sort?
I know that I can do this pre-processing part on the client side, when I add the records to Solr. Yet, I'd like to keep everything consistent, and part of the Solr pipeline, so I'd like the Solr pre-processor to do this for me if possible.
I would solve it with these components:
The solr.UAX29URLEmailTokenizer preserves the URL as a token
The solr.PatternReplaceFilterFactory replaces the URL token with an empty string (search Stack Overflow for a suitable regex pattern)
A solr.LengthFilterFactory filters the zero-length token
In schema.xml:
<analyzer type="index">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="..." replacement="" />
<filter class="solr.LengthFilterFactory" min="1" max="1000" />
</analyzer>
Note that changing the tokenizer from the solr.StandardTokenizerFactory may have implications beyond what is described in this answer, so be sure to test.

How do I replace text in all href attributes of anchor tags?

I need to replace the text inside all href values. I think a regular expression is the way to do it, but I'm no regex pro. Any thoughts on how I'd do the following using ColdFusion?
so it is changed to:
Thanks!
Here's an update to the question: I have this code and need the pattern below:
<cfset matches = ReMatch('<a[^>]*href="http[^"]*"[^>]*>(.+?)</a>', arguments.htmlCode) /> <cfdump var="#matches#">
<cfset links = arrayNew(1)>
<cfloop index="a" array="#matches#">
<cfset arrayAppend(links, rereplace(a, 'need regex'," {clickurl}","all"))>
</cfloop>
<cfdump var="#links#">
Here's how to do it with jSoup HTML parser:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Dom = jsoup.parse( InputHtml ) />
<cfset Dom.select('a[href]').attr('href','{replaced}') />
<cfset NewHtml = Dom.html() />
(On CF9 and earlier, this requires placing the jsoup's jar in CF's lib directory, or using JavaLoader.)
Using a HTML parser is usually better than using regex, not least because it's easier to maintain and understand.
Here's an imperfect way of doing it with a regex:
<cfset NewHtml = InputHtml.replaceAll
( '(?<=<a.{0,99}?\shref\s{0,99}?=\s{0,99}?)(?:"[^"]+|''[^'']+)(["'])'
, '$1{replaced}$1'
)/>
Which hopefully demonstrates why using a tool such as jsoup is definitely the way to go...
(btw, the above is using the Java regex engine (via string.replaceAll), so it can use the lookbehind functionality, which doesn't exist in CF's built-in regex (rereplace/rematch/etc))
Update, based on the new code sample you've provided...
Here is an example of how to use jsoup for what you're doing - it might still need some updates (depending on what {clickurl} is eventually going to be doing), but it currently functions the same as your sample code is attempting:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset links = jsoup.parse( Arguments.HtmlCode )
<!--- select all links beginning http and change their href --->
.select('a[href^=http]').attr('href',' {clickurl}')
<!--- get HTML for all links, then split into array. --->
.outerHtml().split('(?<=</a>)(?!$)')
/>
<cfdump var=#links# />
That middle bit is all a single cfset, but I split it up and added comments for clarity. (You could of course do this with multiple variables and 3+ cfsets if you preferred that.)
Again, it's not a regex, because what you're doing involves parsing HTML, and regex is not designed for parsing tag-based syntax, so isn't very good at it - there are too many quirks and variations with HTML and describing them in a single regex gets very complicated very quickly.

I want to find "Radiohead" but not "Radiohead's" with Sunspot/Solr

I'm using solr via the sunspot gem in a rails project.
I am indexing scraped data.
My indexing is currently done like so:
searchable do
text :title, :boost => 3.0 do
title.gsub(/\'s\b/, "")
end
text :mentions do
mentions.map do |mention|
mention.title.gsub(/\'s\b/, "")
end
end
end
Currently, if I do:
Video.solr_search { fulltext '"Radiohead"' }
Solr will return results with:
Radiohead's
and
Radiohead
I would like to only find:
Radiohead
Is there a way to do this via Sunspot?
Check what filters you have defined in the analyzer section of the field type for your field in schema.xml (in .../solr/conf directory). Here's an example:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
...
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
The behaviour you're seeing is called "stemming" - it's where the indexed value is the stem of the word, rather than the word itself. eg, "fly", "flies", "flew" and "flying" would all be indexed as "fly". If there's a filter like snowball (apache's stemmer), then you'll get the behaviour you're seeing. Try removing the filter, restarting solr then reindexing your documents.
You should do a phrase query (using double quotes) :
Video.solr_search { fulltext '"Radiohead"' }.
Or modify you solr schema.xml so that you don't split "Radiohead's". I don't know your field configuration here so I can't provide more details...

White Space / Coldfusion

What would be the correct way to stop the white space that ColdFusion outputs?
I know there is cfcontent and cfsetting enableCFoutputOnly. What is the correct way to do that?
In addition to <cfsilent>, <cfsetting enablecfoutputonly="yes"> and <cfprocessingdirective suppressWhiteSpace = "true"> is <cfcontent reset="true" />. You can delete whitespaces at the beginning of your document with it.
HTML5 document would then start like this:
<cfcontent type="text/html; charset=utf-8" reset="true" /><!doctype html>
XML document:
<cfcontent reset="yes" type="text/xml; charset=utf-8" /><CFOUTPUT>#VariableHoldingXmlDocAsString#</CFOUTPUT>
This way you won't get the "Content is not allowed in prolog"-error for XML docs.
If you are getting unwanted whitespaces from a function use the output-attribute to suppress any output and return your result as string - for example:
<cffunction name="getMyName" access="public" returntype="string" output="no">
<cfreturn "Seybsen" />
</cffunction>
You can modify the ColdFusion output by getting access to the ColdFusion Outpout Buffer. James Brown recently demo'd this at our user group meeting (Central Florida Web Developers User Group).
<cfscript>
out = getPageContext().getOut().getString();
newOutput = REreplace(out, 'regex', '', 'all');
</cfscript>
A great place to do this would be in Application.cfc onRequestEnd(). Your result could be a single line of HTML which is then sent to the browser. Work with your web server to GZip and you'll cut bandwidth a great deal.
In terms of tags, there is cfsilent
In the administrator there is a setting to 'Enable whitespace management'
Futher reading on cfsilent and cfcontent reset.
If neither <cfsilent> nor <cfsetting enablecfoutputonly="yes"> can satisfy you, then you are probably over-engineering this issue.
When you are asking solely out of aesthetic reasons, my recommendation is: Ignore the whitespace, it does not do any harm.
Alternatively, You can ensure your entire page is stored within a variable and all this processing is done within cfsilent tags. e.g.
<cfsilent>
<!-- some coldfusion -->
<cfsavecontent variable="pageContent">
<html>
<!-- some content -->
</html>
</cfsavecontent>
<!-- reformat pageContent if required -->
</cfsilent><cfoutput>#pageContent#</cfoutput>
Finally, you can perform any additional processing after you've generated the pagecontent but before you output it e.g. a regular expression to remove additional whitespace or some code tidying.
Here's a tip if you use CFC.
If you're not expecting your method to generate any output, use output="false" in <cffunction> and <cfcomponent> (not needed only if you're using CF9 script style). This will eliminate a lot of unwanted whitespaces.
If you have access to the server and want to implement it on every page request search for and install trimflt.jar. It's a Java servlet filter that will remove all whitespace and line breaks before sending it off. Drop the jar in the /WEB-INF/lib dir of CF and edit the web.xml file to add the filter. Its configurable as well to remove comments, exclude files or extensions, and preserve specific strings. Been running it for a few years without a problem. A set it and forget it solution.
I've found that even using every possible way to eliminate whitespace, your code may still have some unwanted spaces or line breaks. If you're still experiencing this you may need to sacrifice well formatted code for desired output.
for example, instead of:
<cfprocessingdirective suppressWhiteSpace = "true">
<cfquery ...>
...
...
...
</cfquery>
<cfoutput>
Welcome to the site #query.userName#
</cfoutput>
</cfprocessingdirective>
You may need to code:
<cfprocessingdirective suppressWhiteSpace = "true"><cfquery ...>
...
...
...
</cfquery><cfoutput>Welcome to the site #query.UserName#</cfoutput></cfprocessingdirective>
This isn't CF adding whitespace, but you adding whitespace when formatting your CF.
HTH