Can use of $regex in MongoDB Aggregation Framework be optimized? - regex

I'm currently using the Aggregation Framework to calculate some aggregate values on a collection. The typical query/operation might include close to 1 million documents that are indexed (the f1/ts match reduces the number of documents down to the 1 million from about 5 million, the set can be made smaller depending on the query parameters, e.g. timeframe selected).
The $match uses indexed attributes of the documents, but I need to include what amounts to a full-text search. The only option that I've used is a $regex match. Unfortunately, I can't anchor my $regex match to the start of the string since the string(s) I'm looking could be anywhere. The text I'm "searching" could be anywhere from a few characters in length to a few thousand.
Just running some basic comparisons, the inclusion of the $regex match attribute nearly doubles the time the calculation takes to complete.
What are my options for optimizing this operation?
Is it possible to achieve this some other way?
Will text-search be available for use in aggregation operations in v2.6?
For reference:
Operation without $regex
db.my_collection.aggregate(
{
$match:{
f1:ObjectId('417abd81...577000006'),
ts:{$gte:t1,$lte:t2}
}
},{
$project:{
ts:1,
p:1
}
},{
$group:{
_id:"$ts",
x: {
$sum:"$p"
}
}
});
Operation with $regex
db.my_collection.aggregate(
{
$match:{
f1:ObjectId('417abd81...577000006'),
ts:{$gte:t1,$lte:t2}
}
},
{
$match:{
c:{$regex: /somevalue/i}
}
},{
$project:{
ts:1,
p:1
}
},{
$group:{
_id:"$ts",
x: {
$sum:"$p"
}
}
});

Related

Perl String Regular Expression

i need some help in Perl regular expressions.
I have this string:
{
"ITEM":[
{
"-itemID": "1000000" ,
"-itemName": "DisneyJuniorLA" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000001" ,
"-itemName": "31 minutos" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000002" ,
"-itemName": "Plaza Sésamo" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
]
}
The string is in a variable: $jsonString
I have another variable: $itemName
I want to only keep in $jsonString the itemId value that is above itemName (where itemName equals $itemName)
I would really appreciate your help. I am really amateur in regular expressions.
Thank you!
Notwithstanding that your JSON string is very slightly malformed (there's an extra comma after the last element in the array that should be fixed by whoever's generating the "JSON"), attempting to use regexps to handle this just means you now have two problems instead of one.
More specifically, objects within JSON are explicitly unordered sets of key/value pairs. It's perfectly possible that whatever's changing the JSON could be rewritten such that the JSON is semantically identical but serialised differently, making anything that relies on the current structure brittle and error prone.
Instead, use a proper JSON decoder, and then traverse the resulting object hierarchy directly to find the desired element:
use JSON;
use utf8;
# decode the JSON
my $obj = decode_json($jsonString);
# get the ITEM array
my $itemRef = $obj->{ITEM};
# find all elements matching the item name
my #match = grep { $_->{'-itemName'} eq $itemName } #{$itemRef};
# extract the item ID
if (#match) {
my $itemID = $match[0]->{'-itemID'};
print $itemID;
}
Don't use a regular expression to parse JSON. Use JSON.
Basically :
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json_string;
{
open( my $json_in, "<", 'test.json' );
local $/;
$json_string = <$json_in>;
}
my $json = decode_json ( $json_string );
print Dumper \$json;
foreach my $item ( #{ $json -> {'ITEM'} } ) {
print $item -> {'-itemID'},"\n";
print $item -> {'-itemName'},"\n";
}
But you have to fix your json first. (There's a trailing comma that shouldn't be there. )
JSON is a defined data transfer structure. Whilst you can technically treat it as 'plain text' and extract things from the text, that's definitively the wrong way to do things.
It might work fine for a good long time, but if your source program changes a little - and change their output, whilst still sticking to the JSON standard - your code will break unexpectedly, and you may not realise. That can set off a domino effect of breakages, making a whole system or site just crash and burn. And worse yet - the source of this crash and burn will be hidden away in some script that hasn't been touched in years, so will be very difficult to fix.
This is one of my pet peeves as a professional sysadmin. Please don't even go there.

Find match over Array of RegEx in MongoDB Collection

Say I have a collection with these fields:
{
"category" : "ONE",
"data": [
{
"regex": "/^[0-9]{2}$/",
"type" : "TYPE1"
},
{
"regex": "/^[a-z]{3}$/",
"type" : "TYPE2"
}
// etc
]
}
So my input is "abc" so I'd like to obtain the corresponding type (or best match, although initially I'm assuming RegExes are exclusive). Is there any possible way to achieve this with decent performance? (that would be excluding iterating over each item of the RegEx array)
Please note the schema could be re-arranged if possible, as this project is still in the design phase. So alternatives would be welcomed.
Each category can have around 100 - 150 RegExes. I plan to have around 300 categories.
But I do know that types are mutually exclusive.
Real world example for one category:
type1=^34[0-9]{4}$,
type2=^54[0-9]{4}$,
type3=^39[0-9]{4}$,
type4=^1[5-9]{2}$,
type5=^2[4-9]{2,3}$
Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.
Some ideas in this direction:
RegEx accepting length (fixed, min, max)
POSIX style character classes ([:alpha:], [:digit:], [:alnum:], etc.)
Tree like Document structure (umm)
Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm before the query.
Having mutually exclusive types in a category simplifies things, but what about between categories?
300 categories # 100-150 RegExps/category => 30k to 45k RegExps
... some would surely be exact duplicates if not most of them.
In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx
Consider rewiting the regexes Collection with Documents as follows:
{
"max_length": NumberLong(2),
"min_length": NumberLong(2),
"regex": "^[0-9][2]$",
"types": [
"ONE/TYPE1",
"NINE/TYPE6"
]
},
{
"max_length": NumberLong(4),
"min_length": NumberLong(3),
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"max_length": NumberLong(6),
"min_length": NumberLong(6),
"regex": "^39[0-9][4]$",
"types": [
"ONE/TYPE3",
"SIX/TYPE2"
]
},
{
"max_length": NumberLong(3),
"min_length": NumberLong(3),
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)
Demo Aggregation code:
function () {
match=null;
query='abc';
db.regexes.aggregate(
{$match: {
max_length: {$gte: query.length},
min_length: {$lte: query.length},
types: /^ONE\//
}
},
{$project: {
regex: 1,
types: 1,
_id:0
}
}
).result.some(function(re){
if (query.match(new RegExp(re.regex))) return match=re.types;
});
return match;
}
Return for 'abc' query:
[
"ONE/TYPE2"
]
this will run against only these two Documents:
{
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
narrowed by the length 3 and having the category ONE.
Could be narrowed even further by implementing POSIX descriptors (easy to test against the searchterm but have to input 2 RegExps in the DB)
Breadth first search.
If your input starts with a letter you can throw away type 1, if it also contains a number you can throw away exclusive(numbers only or letters only) categories, and if it also contains a symbol then keep only a handful of types containing all three. Then follow above advice for remaining categories. In a sense, set up cases for input types and use cases for a select number of 'regex types' to search down to the right one.
Or you can create a regex model based on the input and compare it to the list of regex models existing as a string to get the type. That way you just have to spend resources analyzing the input to build the regex for it.

Error in regex hyphen usage inside brackets utilizing perl

I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].
I am getting the following error:
Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.
I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.
#cyt = <CYT>;
#nuc = <NUC>;
$cyt = join ('',#cyt);
$cyt =~ /\[([^\]]+)\]/g;
#shared = '';
foreach $nuc (#nuc) {
if ($cyt =~ $nuc) {
push #shared, $nuc;
}
}
print #shared;
What I am trying to achieve with this code is compare two different lists loaded into the arrays #cyt and #nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into #shared. Hope that clarifies it a bit.
Your question describes a set intersection, which is covered in the Perl FAQ.
How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in
either A or in B but not in both. Think of it as an xor operation.
Applying it to your problem gives the code below.
Factor out the common code to find the names in the data files. This sub assumes that
every [name] will be entirely contained within a given line rather than crossing a newline boundary
each line of input will contain at most one [name]
If these assumptions are invalid, please provide more representative samples of your inputs.
Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.
sub extract_names {
my($fh) = #_;
my %name;
while (<$fh>) {
++$name{$1} if /\[ ([^\]]+) \]/x;
}
%name;
}
Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.
open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";
my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;
With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.
my %shared;
for (keys %cyt, keys %nuc) {
++$shared{$_};
}
At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.
The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.
for (sort keys %shared) {
if ($shared{$_} > 1) {
print $_, "\n";
}
else {
delete $shared{$_};
}
}

Using a CouchDB view, can I count groups and filter by key range at the same time?

I'm using CouchDB. I'd like to be able to count occurrences of values of specific fields within a date range that can be specified at query time. I seem to be able to do parts of this, but I'm having trouble understanding the best way to pull it all together.
Assuming documents that have a timestamp field and another field, e.g.:
{ date: '20120101-1853', author: 'bart' }
{ date: '20120102-1850', author: 'homer'}
{ date: '20120103-2359', author: 'homer'}
{ date: '20120104-1200', author: 'lisa'}
{ date: '20120815-1250', author: 'lisa'}
I can easily create a view that filters documents by a flexible date range. This can be done with a view like the one below, called with key range parameters, e.g. _view/all-docs?startkey=20120101-0000&endkey=20120201-0000.
all-docs/map.js:
function(doc) {
emit(doc.date, doc);
}
With the data above, this would return a CouchDB view containing just the first 4 docs (the only docs in the date range).
I can also create a query that counts occurrences of a given field, like this, called with grouping, i.e. _view/author-count?group=true:
author-count/map.js:
function(doc) {
emit(doc.author, 1);
}
author-count/reduce.js:
function(keys, values, rereduce) {
return sum(values);
}
This would yield something like:
{
"rows": [
{"key":"bart","value":1},
{"key":"homer","value":2}
{"key":"lisa","value":2}
]
}
However, I can't find the best way to both filter by date and count occurrences. For example, with the data above, I'd like to be able to specify range parameters like startkey=20120101-0000&endkey=20120201-0000 and get a result like this, where the last doc is excluded from the count because it is outside the specified date range:
{
"rows": [
{"key":"bart","value":1},
{"key":"homer","value":2}
{"key":"lisa","value":1}
]
}
What's the most elegant way to do this? Is this achievable with a single query? Should I be using another CouchDB construct, or is a view sufficient for this?
You can get pretty close to the desired result with a list:
{
_id: "_design/authors",
views: {
authors_by_date: {
map: function(doc) {
emit(doc.date, doc.author);
}
}
},
lists: {
count_occurrences: function(head, req) {
start({ headers: { "Content-Type": "application/json" }});
var result = {};
var row;
while(row = getRow()) {
var val = row.value;
if(result[val]) result[val]++;
else result[val] = 1;
}
return result;
}
}
}
This design can be requested as such:
http://<couchurl>/<db>/_design/authors/_list/count_occurrences/authors_by_date?startkey=<startDate>&endkey=<endDate>
This will be slower than a normal map-reduce, and is a bit of a workaround. Unfortunately, this is the only way to do a multi-dimensional query, "which CouchDB isn’t suited for".
The result of requesting this design will be something like this:
{
"bart": 1,
"homer": 2,
"lisa": 2
}
What we do is basically emit a lot of elements, then using a list to group them as we want. A list can be used to display a result in any way you want, but will also often be slower. Whereas a normal map-reduce can be cached and only change according to the diffs, the list will have to be built anew every time it is requested.
It is pretty much as slow as getting all the elements resulting from the map (the overhead of orchestrating the data is mostly negligible): a lot slower than getting the result of a reduce.
If you want to use the list for a different view, you can simply exchange it in the URL you request:
http://<couchurl>/<db>/_design/authors/_list/count_occurrences/<view>
Read more about lists on the couchdb wiki.
You need to create a combined view:
combined/map.js:
function(doc) {
emit([doc.date, doc.author], 1);
}
combined/reduce.js:
_sum
This way you will be able to filter documents by start/end date.
startkey=[20120101-0000, "a"]&endkey=[20120201-0000, "a"]
Although your problem is hard to solve in general case, knowing some more restrictions on the possible queries can help a lot. E.g. if you know you will search on the ranges that will cover full days/months you can user the arrays of [year, month, day, time] instead of the string:
emit([doc.date_year, doc.date_month, doc.date_day, doc.date_time, doc.author] doc);
Even if you cannot predict that all possible queries will fit into grouping based on this key type, splitting the key may help you to optimize your range queries and decrease number of lookups needed (with the cost of some extra space).

Regex to calculate straight poker hand?

Is there a regex to calculate straight poker hand?
I'm using strings to represent the sorted cards, like:
AAAAK#sssss = 4 aces and a king, all of spades.
A2345#ddddd = straight flush, all of diamonds.
In Java, I'm using these regexes:
regexPair = Pattern.compile(".*(\\w)\\1.*#.*");
regexTwoPair = Pattern.compile(".*(\\w)\\1.*(\\w)\\2.*#.*");
regexThree = Pattern.compile(".*(\\w)\\1\\1.*#.*");
regexFour = Pattern.compile(".*(\\w)\\1{3}.*#.*");
regexFullHouse = Pattern.compile("((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*");
regexFlush = Pattern.compile(".*#(\\w)\\1{4}");
How to calculate straight (sequences) values with regex?
EDIT
I open another question to solve the same problem, but using ascii value of char,
to regex be short. Details here.
Thanks!
I have to admit that regular expressions are not the first tool I would have thought of for doing this. I can pretty much guarantee that any RE capable of doing that to an unsorted hand is going to be far more hideous and far less readable than the equivalent procedural code.
Assuming the cards are sorted by face value (and they seem to be otherwise your listed regexes wouldn't work either), and you must use a regex, you could use a construct like
2345A|23456|34567|...|9TJQK|TJQKA
to detect the face value part of the hand.
In fact, from what I gather here of the "standard" hands, the following should be checked in order of decreasing priority:
Royal/straight flush: "(2345A|23456|34567|...|9TJQK|TJQKA)#(\\w)\\1{4}"
Four of a kind: ".*(\\w)\\1{3}.*#.*"
Full house: "((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*"
Flush: ".*#(\\w)\\1{4}"
Straight: "(2345A|23456|34567|...|9TJQK|TJQKA)#.*"
Three of a kind: ".*(\\w)\\1\\1.*#.*"
Two pair: ".*(\\w)\\1.*(\\w)\\2.*#.*"
One pair: ".*(\\w)\\1.*#.*"
High card: (none)
Basically, those are the same as yours except I've added the royal/straight flush and the straight. Provided you check them in order, you should get the best score from the hand. There's no regex for the high card since, at that point, it's the only score you can have.
I also changed the steel wheel (wrap-around) straights from A2345 to 2345A since they'll be sorted that way.
I rewrote the regex for this because I found it frustrating and confusing. Groupings make much more sense for this type of logic. The sorting is being done using a standard array sort method in javascript hence the strange order of the cards, they are in alphabetic order. I did mine in javascript but the regex could be applied to java.
hands = [
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#(.)\2{4}.*/g , name: 'Straight flush' },
{ regex: /(.)\1{3}.*#.*/g , name: 'Four of a kind' },
{ regex: /((.)\2{2}(.)\3{1}#.*|(.)\4{1}(.)\5{2}#.*)/g , name: 'Full house' },
{ regex: /.*#(.)\1{4}.*/g , name: 'Flush' },
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#.*/g , name: 'Straight' },
{ regex: /(.)\1{2}.*#.*/g , name: 'Three of a kind' },
{ regex: /(.)\1{1}.*(.)\2{1}.*#.*/g , name: 'Two pair' },
{ regex: /(.)\1{1}.*#.*/g , name: 'One pair' },
];