Regex with SQL Server 2008 CLR performance issues

Regex with SQL Server 2008 CLR performance issues - regex

I am trying to understand why is it taking so long to execute a simple query.
In my local machine it takes 10 seconds but in production it takes 1 min.
(I imported the database from production into my local database)
select *
from JobHistory
where dbo.LikeInList(InstanceID, 'E218553D-AAD1-47A8-931C-87B52E98A494') = 1
The table DataHistory is not indexed and it has 217,302 rows
public partial class UserDefinedFunctions
{
[SqlFunction]
public static bool LikeInList([SqlFacet(MaxSize = -1)]SqlString value, [SqlFacet(MaxSize = -1)]SqlString list)
{
foreach (string val in list.Value.Split(new char[] { ',' }, StringSplitOptions.None))
{
Regex re = new Regex("^.*" + val.Trim() + ".*$", RegexOptions.IgnoreCase);
if (re.IsMatch(value.Value))
{
return(true);
}
}
return (false);
}
};
And the issue is that if a table has 217k rows then I will be calling that function 217,000 times! not sure how I can rewrite this thing.
Thank you

There are several issues with this code:
Missing (IsDeterministic = true, IsPrecise = true) in [SqlFunction] attribute. Doing this (mainly just the IsDeterministic = true part) will allow the SQLCLR UDF to participate in parallel execution plans. Without setting IsDeterministic = true, this function will prevent parallel plans, just like T-SQL UDFs do.
Return type is bool instead of SqlBoolean
RegEx call is inefficient: using an instance method once is expensive. Switch to using the static Regex.IsMatch instead
RegEx pattern is very inefficient: wrapping the search string in "^.*" and ".*$" will require the RegEx engine to parse and retain in memory as the "match", the entire contents of the value input parameter, for every single iteration of the foreach. Yet the behavior of Regular Expressions is such that simply using val.Trim() as the entire pattern would yield the exact same result.
(optional) If neither input parameter will ever be over 4000 characters, then specify a MaxSize of 4000 instead of -1 since NVARCHAR(4000) is much faster than NVARCHAR(MAX) for passing data into, and out of, SQLCLR objects.

Related

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" method as below and it was able to hold the load until incoming record counts were less than 50K
PS: The regular expression reference data is a broadcasted dataset.
dataset.join(regexDataset.value, expr("input_column rlike regular_exp_column")
Then I wrote a custom UDF to transform them using Scala native regex search as below,
Below val collects the reference data as Array of tuples.
val regexPreCalcArray: Array[(Int, Regex)] = {
regexDataset.value
.select( "col_1", "regex_column")
.collect
.map(row => (row.get(0).asInstanceOf[Int],row.get(1).toString.r))
}
Implementation of Regex matching UDF,
def findMatchingPatterns(regexDSArray: Array[(Int,Regex)]): UserDefinedFunction = {
udf((input_column: String) => {
for {
text <- Option(input_column)
matches = regexDSArray.filter(regexDSValue => if (regexDSValue._2.findFirstIn(text).isEmpty) false else true)
if matches.nonEmpty
} yield matches.map(x => x._1).min
}, IntegerType)
}
Joins are done as below, where a unique ID from reference data will be returned from UDF in case of multiple regex matches and joined against reference data using unique ID to retrieve other columns needed for result,
dataset.withColumn("min_unique_id", findMatchingPatterns(regexPreCalcArray)($"input_column"))
.join(regexDataset.value, $"min_unique_id" === $"unique_id" , "left")
But this too gets very slow with skew in execution [1 executor task runs for a very long time] when record count increases above 1M. Spark suggests not to use UDF as it would degrade the performance, any other best practises I should apply here or if there's a better API for Scala regex match than what I've written here? or any suggestions to do this efficiently would be very helpful.

calculate range of values entered by user Custom function Google Appscript

I want to use arrayformula for my custom function if possible because I want to input a range of values
I also get this error: TypeError: Cannot read property "0" from null.
Also, this: Service invoked too many times in a short time: exec qps. Try Utilities.sleep(1000) between calls
var regExp = new RegExp("Item: ([^:]+)(?=\n)");
var matches=new regExp(input);
return matches[0];
}
Really appreciated some help
Edit:
Based on the second picture, I also try using this regex formula to find word start with "Billing address"
But for the first picture, I used regex formula to find word start with "Item"
The error appears the same for both custom function.

If you want to use a custom function which finds all the strings that start with Item or item and extracts the contents from after the finding, you can use the code provided below. The regular expression is checked by using the match() function and returns the desired result; otherwise, it will return null.
function ITEM(input) {
var regEx = /(?:I|i)tem\s*(.*)$/;
var matches = input.match(regEx);
if (matches && matches.length > 1) {
return matches[1];
} else {
return null;
}
}
If you want to use the RegExp like you did in the code you have shared, you should use \\ instead of \.
For checking and verifying the regular expressions you can use this site.
The Service invoked too many times in a short time: exec qps. Try Utilities.sleep(1000) between calls. error message you are getting is due to the fact that you are trying to call the custom function on too many cells - for example dragging the custom function on too many cells at once. You can check more about this error message here.

CouchDB View - filter keys before grouping

I have a CouchDB database which has documents with the following format:
{ createdBy: 'userId', at: 123456, type: 'action_type' }
I want to write a view that will give me how many actions of each type were created by which user. I was able to do that creating a view that does this:
emit([doc.createdBy, doc.type, doc.at], 1);
With the reduce function "sum" and consuming the view in this way:
/_design/userActionsDoc/_view/userActions?group_level=2
this returns a result with rows just in the way I want:
"rows":[ {"key":["userId","ACTION_1"],"value":20}, ...
the problem is that now I want to filter the results for a given time period. So I want to have the exact same information but only considering actions which happened within a given time period.
I can filter the documents by "at" if I emit the fields in a different order.
?group_level=3&startkey=[149328316160]&endkey=[1493283161647,{},{}]
emit([doc.at, doc.type, doc.createdBy], 1);
but then I won't get the results grouped by userId and actionType. Is there a way to have both? Maybe writing my own reduce function?

I feel your pain. I have done two different things in the past to attempt to solve similar issues.
The first pattern is a pain and may work great or may not work at all. I've experienced both. Your map function looks something like this:
function(doc) {
var obj = {};
obj[doc.createdBy] = {};
obj[doc.createdBy][doc.type] = 1;
emit(doc.at, obj);
// Ignore this for now
// emit(doc.at, JSON.stringify(obj));
}
Then your reduce function looks like this:
function(key, values, rereduce) {
var output = {};
values.forEach(function(v) {
// Ignore this for now
// v = JSON.parse(v);
for (var user in v) {
for (var action in v[user]) {
output[user][action] = (output[user][action] || 0) + v[user][action];
}
}
});
return output;
// Ignore this for now
// return JSON.stringify(output);
}
With large datasets, this usually results in a couch error stating that your reduce function is not shrinking fast enough. In that case, you may be able to stringify/parse the objects as shown in the "ignore" comments in the code.
The reasoning behind this is that couchdb ultimately wants you to output a simple object like a string or integer in a reduce function. In my experience, it doesn't seem to matter that the string gets longer, as long as it remains a string. If you output an object, at some point the function errors because you have added too many props to that object.
The second pattern is potentially better, but requires that your time periods are "defined" ahead of time. If your time period requirements can be locked down to a specific year, specific month, day, quarter, etc. You just emit multiple times in your map function. Below I assume the at property is epoch milliseconds, or at least something that the date constructor can accurately parse.
function(doc) {
var time_key;
var my_date = new Date(doc.at);
//// Used for filtering results in a given year
//// e.g. startkey=["2017"]&endkey=["2017",{}]
time_key = my_date.toISOString().substr(0,4);
emit([time_key, doc.createdBy, doc.type], 1);
//// Used for filtering results in a given month
//// e.g. startkey=["2017-01"]&endkey=["2017-01",{}]
time_key = my_date.toISOString().substr(0,7);
emit([time_key, doc.createdBy, doc.type], 1);
//// Used for filtering results in a given quarter
//// e.g. startkey=["2017Q1"]&endkey=["2017Q1",{}]
time_key = my_date.toISOString().substr(0,4) + 'Q' + Math.floor(my_date.getMonth()/3).toString();
emit([time_key, doc.createdBy, doc.type], 1);
}
Then, your reduce function is the same as in your original. Essentially you're just trying to define a constant value for the first item in your key that corresponds to a defined time period. Works well for business reporting, but not so much for allowing for flexible time periods.

Jmeter Regular Expression Extractor. How to save all returned values to a single variable?

I'm quite new to Jmeter and already spent numerous hours to figure it out.
What i'm trying to achieve:
Using Post Processor Regex Extractor I wrote a regex that returns me several values (already tested it in www.regex101.com and it's working as expected). However, when I do this in Jmeter, I need to provide MatchNo. which in this case will only return to me one certain value. I sort of figured it out that negative digit in this field (Match No) suppose to return all values found. When I use Debug Sampler to find out how many values are returned and to what variables they are assigned, I see a lot of unfamiliar stuff. Please see examples below:
Text where regex to be parsed:
some data here...
"PlanDescription":"DF4-LIB 4224-NNJ"
"PlanDescription":"45U-LIP 2423-NNJ"
"PlanDescription":"PMH-LIB 131-NNJ"
some data here...
As I said earlier, at www.regex101.com I tested this with regex:
\"PlanDescription\":\"([^\"]*)\"
And all needed for me information are correct (with the group 1).
DF4-LIB 4224-NNJ
45U-LIP 2423-NNJ
PMH-LIB 131-NNJ
With the negative number (I tried -1, -2, -3 - same result) at MatchNo. field in Jmeter Regex Extractor field (which Reference Name is Plans) at the Debug Sampler I see the following:
Plans=
Plans_1=DF4-LIB 4224-NNJ
Plans_1_g=1
Plans_1_g0="PlanDescription":"DF4-LIB 4224-NNJ"
Plans_1_g1=DF4-LIB 4224-NNJ
Plans_2=45U-LIP 2423-NNJ
Plans_2_g=1
Plans_2_g0="PlanDescription":"45U-LIP 2423-NNJ"
Plans_2_g1=45U-LIP 2423-NNJ
Plans_3=PMH-LIB 131-NNJ
Plans_3_g=1
Plans_3_g0="PlanDescription":"PMH-LIB 131-NNJ"
Plans_3_g1=PMH-LIB 131-NNJ
I only need at this particular case - Jmeter regex to return 3 values that contain:
DF4-LIB 4224-NNJ
45U-LIP 2423-NNJ
PMH-LIB 131-NNJ
And nothing else. If anybody faced that problem before any help will be appreciated.

Based on output of the Debug Sampler, there's no problem, it's just how RegEx returns the response:
Plans_1,Plans_2,Plans_3 is the actual set of variables you wanted.
There should also be Plans_matchNr which should contain the number of matches (3 in your example). It's important if you loop through them (you will loop from 1 to the value of this variable)
_g sets of variables refer to matching groups per matching instance (3 in your case). Ignore them if you don't care about them. They are always publish, but there's no harm in that.
Once variables are published you can do a number of things:
Use them as ${Plans_1}, ${Plans_2}, ${Plans_3} etc. (as comment above noticed).
Use Plans_... variables in loop: refer to the next variable in the loop as ${__V(Plans_${i})}, where i is a counter with values between 1 and Plans_matchNr
You can also concatenate them into 1 variable using the following simple BeanShell Post-Processor or BeanShell Sampler script:
int count = 0;
String allPlans = "";
// Get number of variables
try {
count = Integer.parseInt(vars.get("Plans_matchNr"));
} catch(NumberFormatException e) {}
// Concatenate them (using space). This could be optimized using StringBuffer of course
for(int i = 1; i <= count; i++) {
allPlans += vars.get("Plans_" + i) + " ";
}
// Save concatenated string into new variable
vars.put("AllPlans", allPlans);
As a result you will have all old variables, plus:
AllPlans=DF4-LIB 4224-NNJ 45U-LIP 2423-NNJ PMH-LIB 131-NNJ

How Lucene scores results in a RegexQuery?

I can see how two values, when doing a regular/fuzzy full text search, can be compared to determine which one is "better" (i.e. one value contains more keywords than the other, one contains less non-keywords than the other).
However, how Lucene computes the score when doing regex queries using RegexQuery? It is a boolean query - a field's value is either compatible with the regex or not. Lucene can't take keywords from my regex query and do its usual magic...

There are two passes. In the first, it generates a list of all terms which match the regex. In the second, it finds all documents with terms matching that regex.
The major code you want to look at is in MultiTermQuery:
public Query rewrite(IndexReader reader) throws IOException {
FilteredTermEnum enumerator = getEnum(reader);
BooleanQuery query = new BooleanQuery();
try {
do {
Term t = enumerator.term();
if (t != null) {
TermQuery tq = new TermQuery(t); // found a match
tq.setBoost(getBoost() * enumerator.difference()); // set the boost
query.add(tq, false, false); // add to query
}
} while (enumerator.next());
} finally {
enumerator.close();
}
return query;
}
Two things:
The boolean query is instantiated with coord on. So the standard coord scoring applies (i.e. the more terms you get, the better).
The boost of the term query is given by enumerator.difference(). However, as of 3.0.1 this just returns 1:
#Override
public final float difference() {
// TODO: adjust difference based on distance of searchTerm.text() and term().text()
return 1.0f;
}
So at some point this will return the distance (probably levenstein) between the terms. But for now it does nothing.

This is just a wild guess, but one possible metric could be the number of backtracking steps the regex engine needs to take to match your search strings.
Of course, these values also depend mightily on the quality of your regex, but when comparing several matches, the one that was "easier to match" could be considered a better match than the one that the regex engine had to go through contortions for.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex with SQL Server 2008 CLR performance issues - regex

Related

Spark Scala: SQL rlike vs Custom UDF

calculate range of values entered by user Custom function Google Appscript

CouchDB View - filter keys before grouping

Jmeter Regular Expression Extractor. How to save all returned values to a single variable?

How Lucene scores results in a RegexQuery?

Categories

Resources