lucene, search for sentences that contain one term twice - regex

I have pairs of search strings, and I want to use Lucene to search for sentences that contain all terms that are contained in these strings. So for example if I have the two search strings "white shark" and "fish", I want all sentences containing both "white", "shark" and "fish". Apparently, with Lucene this can be done rather easily by means of a boolean query; this is how I do it in my code:
String search = str1+" "+ str2;
BooleanQuery booleanQuery = new BooleanQuery();
QueryParser queryParser = new QueryParser(...);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
booleanQuery.add(queryParser.parse(search), BooleanClause.Occur.MUST);
However, I also have pairs of search strings where one string is a subpart of the other, such as e.g. "timber wolf" and "wolf", and in these cases I would like to only get sentences that contain "wolf" at least twice (and "timber" at least once). Is there any way to achieve this with Lucene? Many thanks in advance for your answers!

Keep in mind, that a doc that has both "timber wolf" and a separate "wolf" will get scored higher all else being equal, since the term "wolf" occurs twice giving it a higher tf score. In most cases like this, the desired results being the first ones is acceptable, usually even desirable.
That said, I believe you could get what you want using a phrase query with slop, and having the slop set adequately high. Something like:
"timber wolf wolf"~10000
Which is probably high enough for most cases. This would require two instances of wolf and one of timber.
However, if you need timber wolf to appear (that is, those two terms adjacent and in order), you'll need to abandon the query parser, and construct the appropriate queries yourself. SpanQueries, specifically.
SpanQuery wolfQuery = new SpanTermQuery(new Term("myField", "wolf"));
SpanQuery[] timberWolfSubQueries = {
new SpanTermQuery(new Term("myField", "timber")),
new SpanTermQuery(new Term("myField", "wolf"))
};
//arguments "0, true" mean 0 slop and in order (respectively)
SpanQuery timberWolfQuery = new SpanNearQuery(timberWolfSubQueries, 0, true);
SpanQuery[] finalSubQueries = {
wolfQuery, timberWolfQuery
};
//arguments "10000, false" mean 10000 slop and not (necessarily) in order
SpanQuery finalQuery = new SpanNearQuery(finalSubQueries, 10000, false);

Related

Mixed word and geo search

How can I make a search query that
Finds everything around a location (using aroundLatLng and aroundRadius)
It's biased by a word search
Doesn't require results to include words (e.g. they're used for relevance, but if no words match a record it is not excluded a priori)
?
So for example if I query for "great view" near {lat: 48.8, lng: 2.3}, it should rank by word-relevance first, then by distance, and records which doesn't match any words but are near Paris should come up anyway in the results.
To make sure all words are not mandatory, you can use the optionalWords feature to consider all query words optional (beware, at least 1 will always need to match):
var query = "great view";
index.search(query, {
optionalWords: query,
aroundLatLng: '48.8,2.3',
aroundRadius: 10000
}).then(....);
Then you can eventually move the geo criteria of the default ranking formula down, just before the custom one.

Identifying nearly identical messages in list

It looks like a simple task, but how would you solve it? I don't get any solution right now.
ls_message-text = 'Pernr. 12345678 (Pete Peterson) is valid (06/2015).
append ls_message to lt_message.
ls_message-text = 'Pernr. 12345678 (Pete Peterson) is valid (07/2015).
append ls_message to lt_message.
This is the code I got, the thing is, this is the message I am showing in my application. The customer says that the 2 messages are the same. The second should be deleted.
How would you compare it to delete the line? The table might contain more then 2 lines and also with another text like "is not valid".
I can't extend the structure to have more fields for comparison, I can only use the string comparison on this one field. Are there string comparisons possible with a regex or something?
Maybe you could solve your requirement using the Levenshtein distance . ABAP has a built-in function "distance" that gives you the number of operations to convert one string into another. Ex:
DATA msg1 type string.
DATA msg2 type string.
msg1 = 'Levehnstein Distance 7/2015'.
msg2 = 'Levehnstein Distance 6/2015'.
data l_distance type i.
l_distance = distance( val1 = msg1 val2 = msg2 ).
if l_distance lt 2 .
"It's almost the same text
endif.
In this case l_distance will be 1, because only one operation is necessary (replacing).
Hope this helps,
Assuming you want to retain only one message for each unique Pernr. in lt_message, you can use regex to filter for the Pernr. and use that as "key". Now you can delete all but the first message of lt_message that matches this key.
Expand your regex if you want to keep only certain messages, e.g. only the "is valid" ones.
have you tried looking to program DEMO_REGEX_TOY.
Gives an idea on how to work with Regular expresion, that probably will save the problem

How to find a substring anywhere in a string

This should be easy, but I'm finding it difficult.
I just want to find whether a substring exists anywhere in a string. In my case, whether the name of a website exists in the title of a product.
My code is like this:
#FindNoCase("Amazon.com", "Google Chromecast available at Amazon")#
The above returns a 0 which is correct because the entire substring "Amazon.com" doesn't exist in the main string. But some of it does, namely the "Amazon" part.
How could I achieve what I'm trying to do which is just see if ANY of the substring (at least more than 2 character in length) exists in the main string?
So I need something like FindOneOf() but actually "find at least three of". It should then look at the word "Amazon" in the product title and check if at least 3 characters in the sequence of "Amazon.com" exists. When it sees that "Ama" exists, then it just needs to return a true value. Can it be done using the existing built-in functions somehow?
Update: Very simple solution. I used Left("amazon", 3).
There's a lot of danger in false positives, like if someone was buying the Alabama state flag.
Because of store names that contain spaces, this is a little tricky (Wal Mart is often written with a space).
If your string always contains at [store], you can extract the store name by finding the last at in the sentence and creating a string by chopping off everything else.
Because it looks for occurrences of at only as a whole word, there's no danger with store names such as Beats Audio, or Sam's Meat Shop. I can't think of any any stores with the word at in the name. While that would technically trip it up, there's much lower risk, and you can do a pre-replace on such store names.
<cfset mystring = "Google Chromecast available at Amazon">
<cfset SellerName = REReplaceNoCase(mystring,".*\b(?:at)\b(?!.*\b(?:at)\b)\s*","")>
<cfoutput>Seller: #Sellername#</cfoutput>
You can then do your comparisons much more safely.
Per your comment, If you know all possible patterns, you can still obtain the data if you want to (false positives can either be embarrassing or catastrophic, depending on the action). If you know the stores you're working with, you can use a regex to pull out the string like this
<cfset mystring = "Google Chromecast available at Amazon.co.uk">
<cfset SellerName = REReplaceNoCase(mystring,".*\b((Google|Amazon|Wal[\W]*Mart|E[\W]*bay)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #Sellername#</cfoutput>
The only part you need to update is the pipe-delimited list You might add K-Mart as K[\W]*Mart the [\W]* permits any special character or space so it covers kMart, K-Mart, k*Mart, but not Kwik-E-Mart.
Update #2, per more comments
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset SellerNameRE = REReplace(rsProduct.sellername,"[\W]+","[\W]*","ALL")>
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#sellernameRE#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
This replaces any symbols with the wildcard character so that symbols aren't required so that if something says Wal*Mart, it will still match WalMart.
You could also load a seperate column with "Regex Names" so that you're not doing this each time.
So your table would look something like
SellerID SellerName RegexName
1 Wal-Mart Wal[\W]*Mart
2 Toys-R-US Toys[\W]*R[\W]*US
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#rsProduct.RegexName#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
Solved it by doing this
#FindNoCase(left("Amazon.com", 3), "Google Chromecast available at Amazon")#
Yes there is potential it won't do what I need in cases where the seller name less than 3 characters long. But I think its rare enough to be ok.

Best way to compare phone numbers using Regex

I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));
Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.

How Lucene scores results in a RegexQuery?

I can see how two values, when doing a regular/fuzzy full text search, can be compared to determine which one is "better" (i.e. one value contains more keywords than the other, one contains less non-keywords than the other).
However, how Lucene computes the score when doing regex queries using RegexQuery? It is a boolean query - a field's value is either compatible with the regex or not. Lucene can't take keywords from my regex query and do its usual magic...
There are two passes. In the first, it generates a list of all terms which match the regex. In the second, it finds all documents with terms matching that regex.
The major code you want to look at is in MultiTermQuery:
public Query rewrite(IndexReader reader) throws IOException {
FilteredTermEnum enumerator = getEnum(reader);
BooleanQuery query = new BooleanQuery();
try {
do {
Term t = enumerator.term();
if (t != null) {
TermQuery tq = new TermQuery(t); // found a match
tq.setBoost(getBoost() * enumerator.difference()); // set the boost
query.add(tq, false, false); // add to query
}
} while (enumerator.next());
} finally {
enumerator.close();
}
return query;
}
Two things:
The boolean query is instantiated with coord on. So the standard coord scoring applies (i.e. the more terms you get, the better).
The boost of the term query is given by enumerator.difference(). However, as of 3.0.1 this just returns 1:
#Override
public final float difference() {
// TODO: adjust difference based on distance of searchTerm.text() and term().text()
return 1.0f;
}
So at some point this will return the distance (probably levenstein) between the terms. But for now it does nothing.
This is just a wild guess, but one possible metric could be the number of backtracking steps the regex engine needs to take to match your search strings.
Of course, these values also depend mightily on the quality of your regex, but when comparing several matches, the one that was "easier to match" could be considered a better match than the one that the regex engine had to go through contortions for.