Find match over Array of RegEx in MongoDB Collection - regex

Say I have a collection with these fields:
{
"category" : "ONE",
"data": [
{
"regex": "/^[0-9]{2}$/",
"type" : "TYPE1"
},
{
"regex": "/^[a-z]{3}$/",
"type" : "TYPE2"
}
// etc
]
}
So my input is "abc" so I'd like to obtain the corresponding type (or best match, although initially I'm assuming RegExes are exclusive). Is there any possible way to achieve this with decent performance? (that would be excluding iterating over each item of the RegEx array)
Please note the schema could be re-arranged if possible, as this project is still in the design phase. So alternatives would be welcomed.
Each category can have around 100 - 150 RegExes. I plan to have around 300 categories.
But I do know that types are mutually exclusive.
Real world example for one category:
type1=^34[0-9]{4}$,
type2=^54[0-9]{4}$,
type3=^39[0-9]{4}$,
type4=^1[5-9]{2}$,
type5=^2[4-9]{2,3}$

Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.
Some ideas in this direction:
RegEx accepting length (fixed, min, max)
POSIX style character classes ([:alpha:], [:digit:], [:alnum:], etc.)
Tree like Document structure (umm)
Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm before the query.
Having mutually exclusive types in a category simplifies things, but what about between categories?
300 categories # 100-150 RegExps/category => 30k to 45k RegExps
... some would surely be exact duplicates if not most of them.
In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx
Consider rewiting the regexes Collection with Documents as follows:
{
"max_length": NumberLong(2),
"min_length": NumberLong(2),
"regex": "^[0-9][2]$",
"types": [
"ONE/TYPE1",
"NINE/TYPE6"
]
},
{
"max_length": NumberLong(4),
"min_length": NumberLong(3),
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"max_length": NumberLong(6),
"min_length": NumberLong(6),
"regex": "^39[0-9][4]$",
"types": [
"ONE/TYPE3",
"SIX/TYPE2"
]
},
{
"max_length": NumberLong(3),
"min_length": NumberLong(3),
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)
Demo Aggregation code:
function () {
match=null;
query='abc';
db.regexes.aggregate(
{$match: {
max_length: {$gte: query.length},
min_length: {$lte: query.length},
types: /^ONE\//
}
},
{$project: {
regex: 1,
types: 1,
_id:0
}
}
).result.some(function(re){
if (query.match(new RegExp(re.regex))) return match=re.types;
});
return match;
}
Return for 'abc' query:
[
"ONE/TYPE2"
]
this will run against only these two Documents:
{
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
narrowed by the length 3 and having the category ONE.
Could be narrowed even further by implementing POSIX descriptors (easy to test against the searchterm but have to input 2 RegExps in the DB)

Breadth first search.
If your input starts with a letter you can throw away type 1, if it also contains a number you can throw away exclusive(numbers only or letters only) categories, and if it also contains a symbol then keep only a handful of types containing all three. Then follow above advice for remaining categories. In a sense, set up cases for input types and use cases for a select number of 'regex types' to search down to the right one.
Or you can create a regex model based on the input and compare it to the list of regex models existing as a string to get the type. That way you just have to spend resources analyzing the input to build the regex for it.

Related

VS Code: How to convert snippet placeholder from camelCase to SCREAMING_SNAKE_CASE?

I'd like to create a VS Code snippet for creating redux reducers.
I would like to have a snippet with placeholder that expects camelCase and then transform a matching placeholder to SCREAMING_SNAKE_CASE.
Here's my attempted snippet, which is not working:
"test": {
"prefix": "test",
"body": "${1} -> ${1/([a-zA-Z])(?=[A-Z])/${1:/upcase}_/g}"
},
Which produces a non-desired result:
changeNetworkStatus -> changE_NetworK_Status
Desired Flow
type test (name of snippet)
hit tab to load the snippet.
type changeNetworkStatus to result in:
changeNetworkStatus -> changeNetworkStatus
hit tab to get expected result of:
changeNetworkStatus -> CHANGE_NETWORK_STATUS
How can I change my snippet code to get the desired result?
Here's a related solution which requires a different flow.
If you are starting with a non-camelCase input and want to get to SCREAMING_SNAKE_CASE, see https://stackoverflow.com/a/67008397/836330. The method there can handle input with spaces and hyphens.
Update: Keybinding version:
VScode is adding the editor.action.transformToSnakecase in v1.53 so the requested operation can be done easier without having to figure out the neccessary regex to make it work as shown in the previous answer. And because some people might find this question looking for snake case (snake-case) information.
What I show now is NOT a snippet however. You just type your text and then trigger the keybinding. The keybinding itself fires a macro extension command from the multi-command extension. In keybindings.json:
{
"key": "alt+3", // whatever keybinding you wish
"command": "extension.multiCommand.execute",
"args": {
"sequence": [
"cursorWordLeftSelect", // select word you just typed
"editor.action.transformToSnakecase",
"editor.action.transformToUppercase",
// "cursorLineEnd" // if you want this
]
},
"when": "editorTextFocus && !editorHasSelection"
},
Demo of keybinding version:
Snippet version:
"camelCaseModify": {
"prefix": "test",
"body": [
// first inefficient try, works for up to three words
// "${1} -> ${1/^([a-z]*)([A-Z])([a-z]+)*([A-Z])*([a-z]+)*/${1:/upcase}_$2${3:/upcase}${4:+_}$4${5:/upcase}/g}"
"${1} -> ${1/([a-z]*)(([A-Z])+([a-z]+))?/${1:/upcase}${2:+_}$3${4:/upcase}/g}",
// here is an especially gnarly version to handle edge cases like 'thisISABCTest' and trailing _'s
"${1} -> ${1/([a-z]+)(?=[A-Z])|([A-Z])(?=[A-Z])|([A-Z][a-z]+)(?=$)|([A-Z][a-z]+)|([a-z]+)(?=$)/${1:/upcase}${1:+_}$2${2:+_}${3:/upcase}${4:/upcase}${4:+_}${5:/upcase}/g}"
],
"description": "underscore separators"
},
This works with any number of camelCase words, from one to infinity...
The ${2:+_} means "if there is a capture group 2 then append an underscore." If there isn't a second word/capture group then groups 3 and 4 will be empty anyway because they are within capture group 2. Capture Group 2 is always the next Word (that starts with one capital and followed by at least one small letter).
for example, using changeNetworkStatus:
Match 1
Full match 0-13 `changeNetwork`
Group 1. 0-6 `change`
Group 2. 6-13 `Network`
Group 3. 6-7 `N`
Group 4. 7-13 `etwork`
Match 2
Full match 13-19 `Status`
Group 1. 13-13 ``
Group 2. 13-19 `Status`
Group 3. 13-14 `S`
Group 4. 14-19 `tatus`
Match 3
Full match 19-19 ``
Group 1. 19-19 ``
Sample Output:
abcd -> ABCD
twoFish -> TWO_FISH
threeFishMore -> THREE_FISH_MORE
fourFishOneMore -> FOUR_FISH_ONE_MORE
fiveFishTwoMoreFish -> FIVE_FISH_TWO_MORE_FISH
sixFishEelsSnakesDogsCatsMiceRatsClocksRocks -> SIX_FISH_EELS_SNAKES_DOGS_CATS_MICE_RATS_CLOCKS_ROCKS
Using regex101.com really helps to visualize what is going on!

How to split string in MongoDB?

The example data is as following:
{"BrandId":"a","Method":"PUT","Url":"/random/widgets/random/state"}
{"BrandId":"a","Method":"POST","Url":"/random/collection/random/state"}
{"BrandId":"b","Method":"PUT","Url":"/random/widgets/random/state"}
{"BrandId":"b","Method":"PUT","Url":"/random/widgets/random/state"}
I need to find all the rows with method=put and Url in a pattern /random/widgets/random/state. "random" is a random string with a fixed length. the expected result is :
{"BrandId":"a","total":1}
{"BrandId":"b","total":2}
I tried to write so code as :
db.accessLog.aggregate([
{$group: {
_id: '$BrandId',
total: {
$sum:{
$cond:[{$and: [ {$eq: ['$Method', 'POST']},
{Url:{$regex: /.*\/widgets.*\/state$/}} ]}, 1, 0]
}
},
{$group: {
_id: '$_id',
total:{$sum:'$total'}
}
])
but the regular expression does not work, so I suppose I need to try other way to do it, perhaps split string. And I need to use $cond. please keep it. Thanks!
You can use the following query to achieve what you want, I assume the data in a collection named 'products'
db.products.aggregate([
{$match : {'Method':'PUT','Url':/.*widgets.*\/state$/ }},
{$group: {'_id':'$BrandId','total':{$sum: 1} }}
]);
1. $match:
Find all documents that has 'PUT' method and Url in the specified pattern.
2. $group: Group by brand Id and for each entry, count 1
Greedy matching is the problem.
Assuming non-zero number of 'random' characters (sounds sensible), try a regex of:
/[^\/]+\/widgets\/[^\/]+\/state$/

How to use OR operator in Freebase?

Case:
So, I'm using the OR operator or ONE OF as to get people from any of 2 countries.
The query looks like:
[{
"id": null,
"type": "/people/person",
"/people/person/nationality": {
"name|=": [
"Jordan",
"Ottoman Empire"
]
},
"name": null,
"limit": 30
}]
The query works fine, but it won't work if you increase the limit to be 40 for example. The error returned is "Unique query may have at most one result. Got 2". This means that there exist a person for both nationalities "Jordan" and "Ottoman Empire".
Question:
It makes sense for a "ONE OF" operator, but not for "OR" operator. Is there any operator in Freebase that can query "ANY OF" or true "OR" to cover these cases?
You're getting the error because you used object notation ({}) which expects a single result in a place where you're returning two results and would those need an array ([]).
Having said that, I think what you really need to do is hoist your |= operator up a level to /people/person/nationality. Note also that you need array notation even if just asking for nationality results for a person, because it's multi-valued (e.g. Sirhan Sirhan has both Jordan and Mandatory Palestine as his nationality).
Here's a query that will do what you want (although you should really use IDs for the countries rather than their English labels):
[{
"id": null,
"name": null,
"nationality": [],
"type": "/people/person",
"nationality|=": [
"Jordan",
"Ottoman Empire"
]
}]

Can use of $regex in MongoDB Aggregation Framework be optimized?

I'm currently using the Aggregation Framework to calculate some aggregate values on a collection. The typical query/operation might include close to 1 million documents that are indexed (the f1/ts match reduces the number of documents down to the 1 million from about 5 million, the set can be made smaller depending on the query parameters, e.g. timeframe selected).
The $match uses indexed attributes of the documents, but I need to include what amounts to a full-text search. The only option that I've used is a $regex match. Unfortunately, I can't anchor my $regex match to the start of the string since the string(s) I'm looking could be anywhere. The text I'm "searching" could be anywhere from a few characters in length to a few thousand.
Just running some basic comparisons, the inclusion of the $regex match attribute nearly doubles the time the calculation takes to complete.
What are my options for optimizing this operation?
Is it possible to achieve this some other way?
Will text-search be available for use in aggregation operations in v2.6?
For reference:
Operation without $regex
db.my_collection.aggregate(
{
$match:{
f1:ObjectId('417abd81...577000006'),
ts:{$gte:t1,$lte:t2}
}
},{
$project:{
ts:1,
p:1
}
},{
$group:{
_id:"$ts",
x: {
$sum:"$p"
}
}
});
Operation with $regex
db.my_collection.aggregate(
{
$match:{
f1:ObjectId('417abd81...577000006'),
ts:{$gte:t1,$lte:t2}
}
},
{
$match:{
c:{$regex: /somevalue/i}
}
},{
$project:{
ts:1,
p:1
}
},{
$group:{
_id:"$ts",
x: {
$sum:"$p"
}
}
});

Regex to calculate straight poker hand?

Is there a regex to calculate straight poker hand?
I'm using strings to represent the sorted cards, like:
AAAAK#sssss = 4 aces and a king, all of spades.
A2345#ddddd = straight flush, all of diamonds.
In Java, I'm using these regexes:
regexPair = Pattern.compile(".*(\\w)\\1.*#.*");
regexTwoPair = Pattern.compile(".*(\\w)\\1.*(\\w)\\2.*#.*");
regexThree = Pattern.compile(".*(\\w)\\1\\1.*#.*");
regexFour = Pattern.compile(".*(\\w)\\1{3}.*#.*");
regexFullHouse = Pattern.compile("((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*");
regexFlush = Pattern.compile(".*#(\\w)\\1{4}");
How to calculate straight (sequences) values with regex?
EDIT
I open another question to solve the same problem, but using ascii value of char,
to regex be short. Details here.
Thanks!
I have to admit that regular expressions are not the first tool I would have thought of for doing this. I can pretty much guarantee that any RE capable of doing that to an unsorted hand is going to be far more hideous and far less readable than the equivalent procedural code.
Assuming the cards are sorted by face value (and they seem to be otherwise your listed regexes wouldn't work either), and you must use a regex, you could use a construct like
2345A|23456|34567|...|9TJQK|TJQKA
to detect the face value part of the hand.
In fact, from what I gather here of the "standard" hands, the following should be checked in order of decreasing priority:
Royal/straight flush: "(2345A|23456|34567|...|9TJQK|TJQKA)#(\\w)\\1{4}"
Four of a kind: ".*(\\w)\\1{3}.*#.*"
Full house: "((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*"
Flush: ".*#(\\w)\\1{4}"
Straight: "(2345A|23456|34567|...|9TJQK|TJQKA)#.*"
Three of a kind: ".*(\\w)\\1\\1.*#.*"
Two pair: ".*(\\w)\\1.*(\\w)\\2.*#.*"
One pair: ".*(\\w)\\1.*#.*"
High card: (none)
Basically, those are the same as yours except I've added the royal/straight flush and the straight. Provided you check them in order, you should get the best score from the hand. There's no regex for the high card since, at that point, it's the only score you can have.
I also changed the steel wheel (wrap-around) straights from A2345 to 2345A since they'll be sorted that way.
I rewrote the regex for this because I found it frustrating and confusing. Groupings make much more sense for this type of logic. The sorting is being done using a standard array sort method in javascript hence the strange order of the cards, they are in alphabetic order. I did mine in javascript but the regex could be applied to java.
hands = [
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#(.)\2{4}.*/g , name: 'Straight flush' },
{ regex: /(.)\1{3}.*#.*/g , name: 'Four of a kind' },
{ regex: /((.)\2{2}(.)\3{1}#.*|(.)\4{1}(.)\5{2}#.*)/g , name: 'Full house' },
{ regex: /.*#(.)\1{4}.*/g , name: 'Flush' },
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#.*/g , name: 'Straight' },
{ regex: /(.)\1{2}.*#.*/g , name: 'Three of a kind' },
{ regex: /(.)\1{1}.*(.)\2{1}.*#.*/g , name: 'Two pair' },
{ regex: /(.)\1{1}.*#.*/g , name: 'One pair' },
];