solr PatternReplaceCharFilterFactory working unexpectedly

solr PatternReplaceCharFilterFactory working unexpectedly - regex

I am relatively new to Solr so please forgive me if I'm missing something obvious. I have an application that allows users to search for musical artists. The indexing comes from a read-only database with correct spellings so on the index side I have it figured out.
On the query side however I need to anticipate various spelling errors/differences and want to help solr find those instances. From our old home-grown search solution, I have a list of regex's and the artists they apply to. When I was trying to translate those to solr using the PatternReplaceCharFilterFactory, I noticed that some worked perfectly, while others didn't work at all ... with seeming no rhyme nor reason between them.
For example:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="em[ei]n[ei]m" replacement="Eminem"/>
accurately captures the common misspellings of Eminem. But for the band 311:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Tt]hree [Ee]leven" replacement="311"/>
Does not work. Another example is Nine Inch Nails:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((nine|9).*inch.*nails\b)|(n\.? ?i\.? ?n\.?\b)" replacement="Nine Inch Nails"/>
works perfectly for finding the most common patterns for the band's name. But for Eve 6:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Ee]ve.{0,4}([Ss]ix|6)" replacement="Eve 6"/>
Is there something fundamental I'm missing on the usage of this filter? I've tried a number of variations on the regex's I've mentioned above (even going so far as using literals like 'three eleven'), but still with no success. I've tried making the filter in question the only PatternReplaceCharFilterFactory in the analyzer. I also know for sure that these items are in the index correctly because when I search for the correct spelling it returns the proper results.
Any suggestions?
Snowdall

I suspect the problem is not with your Char Factory, but with what comes after all, specifically the tokenizer. If you use standard tokenizer, it will get rid of the numbers you have just put into your stream. If you don't need the text to be split into tokens, you could look at KeywordTokenizerFactory instead.
In general, the best way to troubleshoot this in Solr 4+ is the Analysis screen in the Admin WebUI. It allows you to enter your text against particular field type and see what happens with it after each component in the analysis chain.

I would recommend using the SynonymFilter for the kind of application you describe. It allows you to provide an external file where you list words and their synonyms, like:
eminem <=> emenem
nine <=> 9
If you precede this with a LowerCaseFilter, you won't have to fuss about case normalization in your synonyms. You should be able to handle the 311 case too as long as you don't tokenize (ie use a KeywordTokenizer as Alexander Rafalovitch suggested).

Related

How to do an efficient search for dynamically defined regexes in Elasticsearch?

I am working in a file system project (like dropbox). For the file system, I have an indexed data for full text search in elastic search. I have lots of large documents and searching works really well. But now my requirement is to use this data to query for some regexes. We have an admin panel for the customer and regexes will be defined dynamically by the customer in admin panel.
I know i can do regex searches in elastic search, but here the problem is tokenizer. For instance, let’s assume that user wants to create a regex pattern and wants to search 3 letters, ‘-’ and 2 digits such that “ABC-12” or "ASD-34". Problem here is my tokenizer. The defined tokenizer omits the character ‘-’, and indexes “ABC” and “12” separately. You may say not the omit ‘-’ character. But user may want to search a pattern with 3 letters, white space and 2 digits to retrieve data "ABC 12". Here white space is the problem. Somehow I have to use a tokenizer and cannot cover all dynamic regexes. So searching in the index does not solve my problem.
Actually for this type of search, I have another option which is to query all data with match all. With search scroll api, I can query all original documents partially. After each response from scroll api, I can run my regex finder in separate thread. So that I can prepare the desired data after the scrolling operation. Do you think this option is good for big data? I think I will need good cpu power and ram. I know it is not a special solution but I can not find any effective solution for my requirement. I am open for better solutions. Thanks.

I believe, ES allows you to analyse the same field multiple times. Documentation states that new analysers can be added to existing fields later:
New multi-fields can be added to existing fields using the PUT mapping API.
This opens up a possibility to dynamically add new analysers (and tokenisers for that matter) as you find what sort of regex your users are after. I am not sure how trivial it will be for your particular use case, but this seems like an avenue to explore

How to link multiple ports from a Expression to multiple groups of a Union

I add an image in order to explain myself better.
I have 300 something ports in a expression. I have created the equivalent number of groups in a union. I want each port of this expression to go to a port/field of the Union. One to one relationship. It seems like powercenter is not able to do this with autolink, or at least I'm unable to find the proper way to do this. How could I work arround this issue? Because I've been told that is likely that in a few days it will be more than 700 ports, and the amount it takes to do by hand is quite insane. Thanks in advance.

I'm surprised it validates... union is for homogenous sources but you seem to be trying to pivot your data (in which case I'd suggest using another transformation i.e. a normalizer and Informatica will start behaving as expected)

Possible solution: make a bunch of connections, save and export the file as xml, go to the lines when the connections are done, and replace that zone with as many rows as you need.
What I did specifically was to get the original rows, change the names as appropiate with the help of notepad++ and excel, and then go back to the original file and replace all of it. Check everything three times, and import the file back to powercenter.
I say possible solution because it's messy and dirty, but even though it may lead to mistakes I feel like the amount is vastly inferior and you have the versioning on your side, so just save before exporting. If someone with more experience could tell me it's thoughts about this, it would be a great opportunity to learn, just leaving this in case it goes unanswered

Testing if a string contains one of several thousand substrings

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.

You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efﬁcient, Distributed Malware Detection

If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.

Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.

Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.

For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

Is there a way to build an easy related posts app in django

It seems to by my nightmare for the last 4 weeks,
I can't come up with a solution for a "related posts" app in django/python in which it takes the users input and comes out with a related post that matches closely with the original input. I've tried using like statements but it seems that they are not sensitive enough.
Such as which i need typos to also be taken into consideration.
is there a library that could save me from all my pain and suffering?

Well, I suppose there are a few different ways to normalize the user input to produce desirable results (although I'm not sure to what extent libraries exist for them). One of the easiest ways to get related posts would be to compare the tags present on that post (granted your posts have tags). If you wanted to go another route, I would take the following steps: remove stop words from the subject, use some kind of stemmer on the remainder, and finally treat the remaining words as "tags" to compare with other posts. For the sake of efficiency, it would probably be a good idea to run these steps in a batch process on all of your current posts and store off the resulting "tags." As far as typos, I'm sure there are a multitude of spelling corrector libraries exist (I found this one after a few seconds with Google).

How do I programmatically sanitize ColdFusion cfquery parameters?

I have inherited a large legacy ColdFusion app. There are hundreds of <cfquery>some sql here #variable#</cfquery> statements that need to be parameterized along the lines of: <cfquery> some sql here <cfqueryparam value="#variable#"/> </cfquery>
How can I go about adding parameterization programmatically?
I have thought about writing some regular expression or sed/awk'y sort of solution, but it seems like somebody somewhere has tackled such a problem. Bonus points awarded for inferring the sql type automatically.

There's a queryparam scanner that will find them for you on RIAForge: http://qpscanner.riaforge.org/

There is a script referenced here: http://www.webapper.net/index.cfm/2008/7/22/ColdFusion-SQL-Injection that will do the majority of the heavy lifting for you. All you have to do is check the queries and make sure the syntax will parse properly.
There is no excuse for not using CFQueryParam, apart from it being much more secure, it is a performance boost and the best way to handle quoted values in character based column types.

Keep in mind that you may not be able to solve everything with <cfqueryparam>.
I've seen a number of examples where the order by field name is being passed in the query string, which is a slightly trickier problem to solve as you need to validate that in a more "manual" way.

<cf_inputFilter
scopes = "FORM,COOKIE,URL"
chars = "<,>,!,&,|,%,=,(,),',{,}"
tags="script,embed,applet,object,HTML">
We used this to counteract a recent SQL injection attack. We added it to the Application.cfm file for our site.

I doubt that there is a solution that will fit your needs exactly. The only option I see is to write your own recursive search that builds a report for you or use one of the apps/scripts that people have listed above. Basically, you are going to have to edit each page or approve all of the automated changes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js