How Lucene scores results in a RegexQuery?

How Lucene scores results in a RegexQuery? - regex

I can see how two values, when doing a regular/fuzzy full text search, can be compared to determine which one is "better" (i.e. one value contains more keywords than the other, one contains less non-keywords than the other).
However, how Lucene computes the score when doing regex queries using RegexQuery? It is a boolean query - a field's value is either compatible with the regex or not. Lucene can't take keywords from my regex query and do its usual magic...

There are two passes. In the first, it generates a list of all terms which match the regex. In the second, it finds all documents with terms matching that regex.
The major code you want to look at is in MultiTermQuery:
public Query rewrite(IndexReader reader) throws IOException {
FilteredTermEnum enumerator = getEnum(reader);
BooleanQuery query = new BooleanQuery();
try {
do {
Term t = enumerator.term();
if (t != null) {
TermQuery tq = new TermQuery(t); // found a match
tq.setBoost(getBoost() * enumerator.difference()); // set the boost
query.add(tq, false, false); // add to query
}
} while (enumerator.next());
} finally {
enumerator.close();
}
return query;
}
Two things:
The boolean query is instantiated with coord on. So the standard coord scoring applies (i.e. the more terms you get, the better).
The boost of the term query is given by enumerator.difference(). However, as of 3.0.1 this just returns 1:
#Override
public final float difference() {
// TODO: adjust difference based on distance of searchTerm.text() and term().text()
return 1.0f;
}
So at some point this will return the distance (probably levenstein) between the terms. But for now it does nothing.

This is just a wild guess, but one possible metric could be the number of backtracking steps the regex engine needs to take to match your search strings.
Of course, these values also depend mightily on the quality of your regex, but when comparing several matches, the one that was "easier to match" could be considered a better match than the one that the regex engine had to go through contortions for.

Related

How to get the most accurate term in regex?

I have an angular app using the mongodb sdk for js.
I would like to suggest some words on a input field for the user from my words collection, so I did:
getSuggestions(term: string) {
var regex = new stitch.BSON.BSONRegExp('^' +term , 'i');
return from(this.words.find({ 'Noun': { $regex: regex } }).execute());
}
The problem is that if the user type for example Bie, the query returns a lot of documents but the most accurated are the last ones, for example Bier, first it returns the bigger words, like Bieberbach'sche Vermutung. How can I deal to return the closests documents first?

A regular-expression is probably not enough to do what you are intending to do here. They can only do what they're meant to do – match a string. They might be used to give you a candidate entry to present to the user, but can't judge or weigh them. You're going to have to devise that logic yourself.

Jmeter Regular Expression Extractor. How to save all returned values to a single variable?

I'm quite new to Jmeter and already spent numerous hours to figure it out.
What i'm trying to achieve:
Using Post Processor Regex Extractor I wrote a regex that returns me several values (already tested it in www.regex101.com and it's working as expected). However, when I do this in Jmeter, I need to provide MatchNo. which in this case will only return to me one certain value. I sort of figured it out that negative digit in this field (Match No) suppose to return all values found. When I use Debug Sampler to find out how many values are returned and to what variables they are assigned, I see a lot of unfamiliar stuff. Please see examples below:
Text where regex to be parsed:
some data here...
"PlanDescription":"DF4-LIB 4224-NNJ"
"PlanDescription":"45U-LIP 2423-NNJ"
"PlanDescription":"PMH-LIB 131-NNJ"
some data here...
As I said earlier, at www.regex101.com I tested this with regex:
\"PlanDescription\":\"([^\"]*)\"
And all needed for me information are correct (with the group 1).
DF4-LIB 4224-NNJ
45U-LIP 2423-NNJ
PMH-LIB 131-NNJ
With the negative number (I tried -1, -2, -3 - same result) at MatchNo. field in Jmeter Regex Extractor field (which Reference Name is Plans) at the Debug Sampler I see the following:
Plans=
Plans_1=DF4-LIB 4224-NNJ
Plans_1_g=1
Plans_1_g0="PlanDescription":"DF4-LIB 4224-NNJ"
Plans_1_g1=DF4-LIB 4224-NNJ
Plans_2=45U-LIP 2423-NNJ
Plans_2_g=1
Plans_2_g0="PlanDescription":"45U-LIP 2423-NNJ"
Plans_2_g1=45U-LIP 2423-NNJ
Plans_3=PMH-LIB 131-NNJ
Plans_3_g=1
Plans_3_g0="PlanDescription":"PMH-LIB 131-NNJ"
Plans_3_g1=PMH-LIB 131-NNJ
I only need at this particular case - Jmeter regex to return 3 values that contain:
DF4-LIB 4224-NNJ
45U-LIP 2423-NNJ
PMH-LIB 131-NNJ
And nothing else. If anybody faced that problem before any help will be appreciated.

Based on output of the Debug Sampler, there's no problem, it's just how RegEx returns the response:
Plans_1,Plans_2,Plans_3 is the actual set of variables you wanted.
There should also be Plans_matchNr which should contain the number of matches (3 in your example). It's important if you loop through them (you will loop from 1 to the value of this variable)
_g sets of variables refer to matching groups per matching instance (3 in your case). Ignore them if you don't care about them. They are always publish, but there's no harm in that.
Once variables are published you can do a number of things:
Use them as ${Plans_1}, ${Plans_2}, ${Plans_3} etc. (as comment above noticed).
Use Plans_... variables in loop: refer to the next variable in the loop as ${__V(Plans_${i})}, where i is a counter with values between 1 and Plans_matchNr
You can also concatenate them into 1 variable using the following simple BeanShell Post-Processor or BeanShell Sampler script:
int count = 0;
String allPlans = "";
// Get number of variables
try {
count = Integer.parseInt(vars.get("Plans_matchNr"));
} catch(NumberFormatException e) {}
// Concatenate them (using space). This could be optimized using StringBuffer of course
for(int i = 1; i <= count; i++) {
allPlans += vars.get("Plans_" + i) + " ";
}
// Save concatenated string into new variable
vars.put("AllPlans", allPlans);
As a result you will have all old variables, plus:
AllPlans=DF4-LIB 4224-NNJ 45U-LIP 2423-NNJ PMH-LIB 131-NNJ

Regex with SQL Server 2008 CLR performance issues

I am trying to understand why is it taking so long to execute a simple query.
In my local machine it takes 10 seconds but in production it takes 1 min.
(I imported the database from production into my local database)
select *
from JobHistory
where dbo.LikeInList(InstanceID, 'E218553D-AAD1-47A8-931C-87B52E98A494') = 1
The table DataHistory is not indexed and it has 217,302 rows
public partial class UserDefinedFunctions
{
[SqlFunction]
public static bool LikeInList([SqlFacet(MaxSize = -1)]SqlString value, [SqlFacet(MaxSize = -1)]SqlString list)
{
foreach (string val in list.Value.Split(new char[] { ',' }, StringSplitOptions.None))
{
Regex re = new Regex("^.*" + val.Trim() + ".*$", RegexOptions.IgnoreCase);
if (re.IsMatch(value.Value))
{
return(true);
}
}
return (false);
}
};
And the issue is that if a table has 217k rows then I will be calling that function 217,000 times! not sure how I can rewrite this thing.
Thank you

There are several issues with this code:
Missing (IsDeterministic = true, IsPrecise = true) in [SqlFunction] attribute. Doing this (mainly just the IsDeterministic = true part) will allow the SQLCLR UDF to participate in parallel execution plans. Without setting IsDeterministic = true, this function will prevent parallel plans, just like T-SQL UDFs do.
Return type is bool instead of SqlBoolean
RegEx call is inefficient: using an instance method once is expensive. Switch to using the static Regex.IsMatch instead
RegEx pattern is very inefficient: wrapping the search string in "^.*" and ".*$" will require the RegEx engine to parse and retain in memory as the "match", the entire contents of the value input parameter, for every single iteration of the foreach. Yet the behavior of Regular Expressions is such that simply using val.Trim() as the entire pattern would yield the exact same result.
(optional) If neither input parameter will ever be over 4000 characters, then specify a MaxSize of 4000 instead of -1 since NVARCHAR(4000) is much faster than NVARCHAR(MAX) for passing data into, and out of, SQLCLR objects.

lucene, search for sentences that contain one term twice

I have pairs of search strings, and I want to use Lucene to search for sentences that contain all terms that are contained in these strings. So for example if I have the two search strings "white shark" and "fish", I want all sentences containing both "white", "shark" and "fish". Apparently, with Lucene this can be done rather easily by means of a boolean query; this is how I do it in my code:
String search = str1+" "+ str2;
BooleanQuery booleanQuery = new BooleanQuery();
QueryParser queryParser = new QueryParser(...);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
booleanQuery.add(queryParser.parse(search), BooleanClause.Occur.MUST);
However, I also have pairs of search strings where one string is a subpart of the other, such as e.g. "timber wolf" and "wolf", and in these cases I would like to only get sentences that contain "wolf" at least twice (and "timber" at least once). Is there any way to achieve this with Lucene? Many thanks in advance for your answers!

Keep in mind, that a doc that has both "timber wolf" and a separate "wolf" will get scored higher all else being equal, since the term "wolf" occurs twice giving it a higher tf score. In most cases like this, the desired results being the first ones is acceptable, usually even desirable.
That said, I believe you could get what you want using a phrase query with slop, and having the slop set adequately high. Something like:
"timber wolf wolf"~10000
Which is probably high enough for most cases. This would require two instances of wolf and one of timber.
However, if you need timber wolf to appear (that is, those two terms adjacent and in order), you'll need to abandon the query parser, and construct the appropriate queries yourself. SpanQueries, specifically.
SpanQuery wolfQuery = new SpanTermQuery(new Term("myField", "wolf"));
SpanQuery[] timberWolfSubQueries = {
new SpanTermQuery(new Term("myField", "timber")),
new SpanTermQuery(new Term("myField", "wolf"))
};
//arguments "0, true" mean 0 slop and in order (respectively)
SpanQuery timberWolfQuery = new SpanNearQuery(timberWolfSubQueries, 0, true);
SpanQuery[] finalSubQueries = {
wolfQuery, timberWolfQuery
};
//arguments "10000, false" mean 10000 slop and not (necessarily) in order
SpanQuery finalQuery = new SpanNearQuery(finalSubQueries, 10000, false);

Validating a Salesforce Id

Is there a way to validate a Salesforce ID, maybe using RegEx? They are normally 15 chars or 18 chars but do they follow a pattern that we can use to check that it's a valid id.

There are two levels of validating salesforce id:
check format using regular expression [a-zA-Z0-9]{15}|[a-zA-Z0-9]{18}
for 18-characted ids you can check the the 3-character checksum:
Code examples provided in comments:
C#
Go
Javascript
Ruby

Something like this should work:
[a-zA-Z0-9]{15,18}
It was suggested that this may be more correct because it prevents Ids with lengths of 16 and 17 characters to be rejected, also we try to match against 18 char length first with 15 length as a fallback:
[a-zA-Z0-9]{18}|[a-zA-Z0-9]{15}

Just use instanceOf to check if the string is an instance of Id.
String s = '1234';
if (s instanceOf Id) System.debug('valid id');
else System.debug('invalid id');

The easiest way I've come across, is to create a new ID variable and assign a String to it.
ID MyTestID = null;
try {
MyTestID = MyTestString; }
catch(Exception ex) { }
If MyTestID is null after trying to assign it, the ID was invalid.

This regex has given me the optimal results so far.
\b[a-z0-9]\w{4}0\w{12}|[a-z0-9]\w{4}0\w{9}\b

You can also check for 15 chars, and then add an extra 3 chars optional, with an expression similar to:
^[a-z0-9]{15}(?:[a-z0-9]{3})?$
on i mode, or not:
^[A-Za-z0-9]{15}(?:[A-Za-z0-9]{3})?$
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Javascript: /^(?=.*?\d)(?=.*?[a-z])[a-z\d]{18}$/i
These were the Salesforce Id validation requirements for me.
18 characters only
At least one digit
At least one alphabet
Case insensitive
Test cases
Should fail
1
a
1234
abgcde
1234aDcde
12345678901234567*
123456789012345678
abcDefghijabcdefgh
Should pass
1234567890abcDeFgh
1234abcd1234abcd12
abcd1234abcd1234ab
1abcDefhijabcdefgf
abcDefghijabcdefg1
12345678901234567a
a12345678901234567
For understanding the regex, please refer this thread

The regex provided by Daniel Sokolowski works perfectly to verify if the id is in the correct format.
If you want to verify if an id corresponds to an actual record in the database, you'll need to first find the object type from the first three characters (commonly known as prefix) and then query the object type:
boolean isValidAndExists(String key) {
Map<String, Schema.SObjectType> objTypes = Schema.getGlobalDescribe();
for (Schema.SObjectType objType : objTypes.values()) {
Schema.DescribeSObjectResult objDesc = objType.getDescribe();
if (objDesc.getKeyPrefix() == key.substring(0,3)) {
String objName = objDesc.getName();
String query = 'SELECT Id FROM ' + objName + ' WHERE Id = \'' + key + '\'';
SObject[] objs = Database.query(query);
return !objs.isEmpty();
}
}
return false;
}
Be aware that Schema.getGlobalDescribe can be an expensive operation and degrade the performance of your application if you use that often.
If you need to check that often, I recommend creating a Custom Setting or Custom Metadata to store the relation between prefixes and object types.

Assuming you want to validate Ids in Apex, there are a few approaches discussed in the other answers. Here is an alternative, with notes on the various approaches.
The try-catch method (credit to #matt_k) certainly works, but some folks worry about overhead, especially if testing many Ids.
I used instanceof Id for a long time (credit to #melani_s), until I discovered that it sometimes gives the wrong answer (e.g., '481D0B74-41CF-47E9').
Multiple answers suggest regexen. As the accepted answer correctly points out (credit to #zacheusz), 18 character Ids are only valid if their checksums are correct, which means the regex solutions can be wrong. That answer also helpfully provides code in several languages to test Id checksums. But not in Apex.
I was going to implement the checksum code in Apex, but then I realized the Salesforce had already done the work, so instead I just convert 18 digit Ids to 15 digit Ids (via .to15() which uses the checksum to fix capitalization, as opposed to truncating the string) and then back to 18 digits to let SF do the checksum calc, then I compare the original checksum and the new one. This is my method:
static Pattern ID_REGEX = Pattern.compile('[a-zA-Z0-9]{15}(?:[A-Z0-5]{3})?');
/**
* #description Determines if a string is a valid SalesforceId. Confirms checksum of 18 digit Ids.
* Works for cases where `x instanceof id` returns the wrong answer, like '481D0B74-41CF-47E9'.
* Does NOT check for the existence of a record with the given Id.
* #param s a string to validate
*
* #return true if the string `s` is a valid Salesforce Id.
*/
public static Boolean isValidId(String s) {
Matcher m = ID_REGEX.matcher(s);
if (m.matches() == false) return false; // if it doesn't match the regex it cannot be valid
if (s.length() == 15) return true; // if 15 char string matches the regex, assume it must be valid
String check = (Id)((Id)s).to15(); // Convert to 15 char Id, then to Id and back to string, giving correct 18-char Id
return s.right(3) == check.right(3); // if 18 char string matches the regex, valid if checksum correct
}

Additionally checking getSObjectType() != null would be perfect if we are dealing with Salesforce records
public static boolean isRecordId(string recordId){
try{
return string.isNotBlank(recordId) && ((Id)recordId.trim()).getSObjectType() != null;
}catch(Exception ex){
return false;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How Lucene scores results in a RegexQuery? - regex

Related

How to get the most accurate term in regex?

Jmeter Regular Expression Extractor. How to save all returned values to a single variable?

Regex with SQL Server 2008 CLR performance issues

lucene, search for sentences that contain one term twice

Validating a Salesforce Id

Categories

Resources