How to split string in MongoDB? - regex

The example data is as following:
{"BrandId":"a","Method":"PUT","Url":"/random/widgets/random/state"}
{"BrandId":"a","Method":"POST","Url":"/random/collection/random/state"}
{"BrandId":"b","Method":"PUT","Url":"/random/widgets/random/state"}
{"BrandId":"b","Method":"PUT","Url":"/random/widgets/random/state"}
I need to find all the rows with method=put and Url in a pattern /random/widgets/random/state. "random" is a random string with a fixed length. the expected result is :
{"BrandId":"a","total":1}
{"BrandId":"b","total":2}
I tried to write so code as :
db.accessLog.aggregate([
{$group: {
_id: '$BrandId',
total: {
$sum:{
$cond:[{$and: [ {$eq: ['$Method', 'POST']},
{Url:{$regex: /.*\/widgets.*\/state$/}} ]}, 1, 0]
}
},
{$group: {
_id: '$_id',
total:{$sum:'$total'}
}
])
but the regular expression does not work, so I suppose I need to try other way to do it, perhaps split string. And I need to use $cond. please keep it. Thanks!

You can use the following query to achieve what you want, I assume the data in a collection named 'products'
db.products.aggregate([
{$match : {'Method':'PUT','Url':/.*widgets.*\/state$/ }},
{$group: {'_id':'$BrandId','total':{$sum: 1} }}
]);
1. $match:
Find all documents that has 'PUT' method and Url in the specified pattern.
2. $group: Group by brand Id and for each entry, count 1

Greedy matching is the problem.
Assuming non-zero number of 'random' characters (sounds sensible), try a regex of:
/[^\/]+\/widgets\/[^\/]+\/state$/

Related

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

MongoDB query with special characters in key

In my case, I have keys in my MongoDB database that contain a dot in their name (see attached screenshot). I have read that it is possible to store data in MongoDB this way, but the driver prevents queries with dots in the key. Anyway, in my MongoDB database, keys do contain dots and I have to work with them.
I have now tried to encode the dots in the query (. to \u002e) but it did not seem to work. Then I had the idea to work with regex to replace the dots in the query with any character but regex seems to only work for the value and not for the key.
Does anyone have a creative idea how I can get around this problem? For example, I want to have all the CVE numbers for 'cve_results.BusyBox 1.12.1'.
Update #1:
The structure of cve_results is as follows:
"cve_results" : {
"BusyBox 1.12.1" : {
"CVE-2018-1000500" : {
"score2" : "6.8",
"score3" : "8.1",
"cpe_version" : "N/A"
},
"CVE-2018-1000517" : {
"score2" : "7.5",
"score3" : "9.8",
"cpe_version" : "N/A"
}
}}
With the following workaround I was able to directly access documents by their keys, even though they have a dot in their key:
db.getCollection('mycollection').aggregate([
{$match: {mymapfield: {$type: "object" }}}, //filter objects with right field type
{$project: {mymapfield: { $objectToArray: "$mymapfield" }}}, //"unwind" map to array of {k: key, v: value} objects
{$match: {mymapfield: {k: "my.key.with.dot", v: "myvalue"}}} //query
])
If possible, it could be worth inserting documents using \u002e instead of the dot, that way you can query them while retaining the ASCII values of the . for any client rendering.
However, It appears there's a work around to query them like so:
db.collection.aggregate({
$match: {
"BusyBox 1.12.1" : "<value>"
}
})
You should be able to use $eq operator to query fields with dots in names.

How to find the day of the week from timestamp

I have a timestamp 2015-11-01 21:45:25,296 like I mentioned above. is it possible to extract the the day of the week(Mon, Tue,etc) using any regular expression or grok pattern.
Thanks in advance
this is quite easy if you want to use the ruby filter. I am lazy so I am only doing this.
Here is my filter:
filter {
ruby {
code => "
p = Time.parse(event['message']);
event['day-of-week'] = p.strftime('%A');
"
}
}
The 'message' variable is the field that contains your timestamp
With stdin and stdout and your string, you get:
artur#pandaadb:~/dev/logstash$ ./logstash-2.3.2/bin/logstash -f conf2/
Settings: Default pipeline workers: 8
Pipeline main started
2015-11-01 21:45:25,296
{
"message" => "2015-11-01 21:45:25,296",
"#version" => "1",
"#timestamp" => "2016-08-03T13:07:31.377Z",
"host" => "pandaadb",
"day-of-week" => "Sunday"
}
Hope that is what you need,
Artur
What you want is:
Assuming your string is 2015-11-01 21:45:25,296
mydate='2015-11-01'
date +%a -d ${mydate% *}
Will give you what you want.
Short answer is not, you can't.
A regex, according to Wikipedia:
...is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.
So, a regex allow you to parse a String, it searches for information within the String, but it doesn't make calculations over it.
If you want to make such calculations you need help from a programming language (java, c#, or Ruby[like #pandaadb suggested] etc) or some other tool that makes those calculations (Epoch Converter).

Find match over Array of RegEx in MongoDB Collection

Say I have a collection with these fields:
{
"category" : "ONE",
"data": [
{
"regex": "/^[0-9]{2}$/",
"type" : "TYPE1"
},
{
"regex": "/^[a-z]{3}$/",
"type" : "TYPE2"
}
// etc
]
}
So my input is "abc" so I'd like to obtain the corresponding type (or best match, although initially I'm assuming RegExes are exclusive). Is there any possible way to achieve this with decent performance? (that would be excluding iterating over each item of the RegEx array)
Please note the schema could be re-arranged if possible, as this project is still in the design phase. So alternatives would be welcomed.
Each category can have around 100 - 150 RegExes. I plan to have around 300 categories.
But I do know that types are mutually exclusive.
Real world example for one category:
type1=^34[0-9]{4}$,
type2=^54[0-9]{4}$,
type3=^39[0-9]{4}$,
type4=^1[5-9]{2}$,
type5=^2[4-9]{2,3}$
Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.
Some ideas in this direction:
RegEx accepting length (fixed, min, max)
POSIX style character classes ([:alpha:], [:digit:], [:alnum:], etc.)
Tree like Document structure (umm)
Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm before the query.
Having mutually exclusive types in a category simplifies things, but what about between categories?
300 categories # 100-150 RegExps/category => 30k to 45k RegExps
... some would surely be exact duplicates if not most of them.
In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx
Consider rewiting the regexes Collection with Documents as follows:
{
"max_length": NumberLong(2),
"min_length": NumberLong(2),
"regex": "^[0-9][2]$",
"types": [
"ONE/TYPE1",
"NINE/TYPE6"
]
},
{
"max_length": NumberLong(4),
"min_length": NumberLong(3),
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"max_length": NumberLong(6),
"min_length": NumberLong(6),
"regex": "^39[0-9][4]$",
"types": [
"ONE/TYPE3",
"SIX/TYPE2"
]
},
{
"max_length": NumberLong(3),
"min_length": NumberLong(3),
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)
Demo Aggregation code:
function () {
match=null;
query='abc';
db.regexes.aggregate(
{$match: {
max_length: {$gte: query.length},
min_length: {$lte: query.length},
types: /^ONE\//
}
},
{$project: {
regex: 1,
types: 1,
_id:0
}
}
).result.some(function(re){
if (query.match(new RegExp(re.regex))) return match=re.types;
});
return match;
}
Return for 'abc' query:
[
"ONE/TYPE2"
]
this will run against only these two Documents:
{
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
narrowed by the length 3 and having the category ONE.
Could be narrowed even further by implementing POSIX descriptors (easy to test against the searchterm but have to input 2 RegExps in the DB)
Breadth first search.
If your input starts with a letter you can throw away type 1, if it also contains a number you can throw away exclusive(numbers only or letters only) categories, and if it also contains a symbol then keep only a handful of types containing all three. Then follow above advice for remaining categories. In a sense, set up cases for input types and use cases for a select number of 'regex types' to search down to the right one.
Or you can create a regex model based on the input and compare it to the list of regex models existing as a string to get the type. That way you just have to spend resources analyzing the input to build the regex for it.

Regex to calculate straight poker hand?

Is there a regex to calculate straight poker hand?
I'm using strings to represent the sorted cards, like:
AAAAK#sssss = 4 aces and a king, all of spades.
A2345#ddddd = straight flush, all of diamonds.
In Java, I'm using these regexes:
regexPair = Pattern.compile(".*(\\w)\\1.*#.*");
regexTwoPair = Pattern.compile(".*(\\w)\\1.*(\\w)\\2.*#.*");
regexThree = Pattern.compile(".*(\\w)\\1\\1.*#.*");
regexFour = Pattern.compile(".*(\\w)\\1{3}.*#.*");
regexFullHouse = Pattern.compile("((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*");
regexFlush = Pattern.compile(".*#(\\w)\\1{4}");
How to calculate straight (sequences) values with regex?
EDIT
I open another question to solve the same problem, but using ascii value of char,
to regex be short. Details here.
Thanks!
I have to admit that regular expressions are not the first tool I would have thought of for doing this. I can pretty much guarantee that any RE capable of doing that to an unsorted hand is going to be far more hideous and far less readable than the equivalent procedural code.
Assuming the cards are sorted by face value (and they seem to be otherwise your listed regexes wouldn't work either), and you must use a regex, you could use a construct like
2345A|23456|34567|...|9TJQK|TJQKA
to detect the face value part of the hand.
In fact, from what I gather here of the "standard" hands, the following should be checked in order of decreasing priority:
Royal/straight flush: "(2345A|23456|34567|...|9TJQK|TJQKA)#(\\w)\\1{4}"
Four of a kind: ".*(\\w)\\1{3}.*#.*"
Full house: "((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*"
Flush: ".*#(\\w)\\1{4}"
Straight: "(2345A|23456|34567|...|9TJQK|TJQKA)#.*"
Three of a kind: ".*(\\w)\\1\\1.*#.*"
Two pair: ".*(\\w)\\1.*(\\w)\\2.*#.*"
One pair: ".*(\\w)\\1.*#.*"
High card: (none)
Basically, those are the same as yours except I've added the royal/straight flush and the straight. Provided you check them in order, you should get the best score from the hand. There's no regex for the high card since, at that point, it's the only score you can have.
I also changed the steel wheel (wrap-around) straights from A2345 to 2345A since they'll be sorted that way.
I rewrote the regex for this because I found it frustrating and confusing. Groupings make much more sense for this type of logic. The sorting is being done using a standard array sort method in javascript hence the strange order of the cards, they are in alphabetic order. I did mine in javascript but the regex could be applied to java.
hands = [
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#(.)\2{4}.*/g , name: 'Straight flush' },
{ regex: /(.)\1{3}.*#.*/g , name: 'Four of a kind' },
{ regex: /((.)\2{2}(.)\3{1}#.*|(.)\4{1}(.)\5{2}#.*)/g , name: 'Full house' },
{ regex: /.*#(.)\1{4}.*/g , name: 'Flush' },
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#.*/g , name: 'Straight' },
{ regex: /(.)\1{2}.*#.*/g , name: 'Three of a kind' },
{ regex: /(.)\1{1}.*(.)\2{1}.*#.*/g , name: 'Two pair' },
{ regex: /(.)\1{1}.*#.*/g , name: 'One pair' },
];