Fluentd Parsing - regex

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted

Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}

Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

Related

MongoDB query with special characters in key

In my case, I have keys in my MongoDB database that contain a dot in their name (see attached screenshot). I have read that it is possible to store data in MongoDB this way, but the driver prevents queries with dots in the key. Anyway, in my MongoDB database, keys do contain dots and I have to work with them.
I have now tried to encode the dots in the query (. to \u002e) but it did not seem to work. Then I had the idea to work with regex to replace the dots in the query with any character but regex seems to only work for the value and not for the key.
Does anyone have a creative idea how I can get around this problem? For example, I want to have all the CVE numbers for 'cve_results.BusyBox 1.12.1'.
Update #1:
The structure of cve_results is as follows:
"cve_results" : {
"BusyBox 1.12.1" : {
"CVE-2018-1000500" : {
"score2" : "6.8",
"score3" : "8.1",
"cpe_version" : "N/A"
},
"CVE-2018-1000517" : {
"score2" : "7.5",
"score3" : "9.8",
"cpe_version" : "N/A"
}
}}
With the following workaround I was able to directly access documents by their keys, even though they have a dot in their key:
db.getCollection('mycollection').aggregate([
{$match: {mymapfield: {$type: "object" }}}, //filter objects with right field type
{$project: {mymapfield: { $objectToArray: "$mymapfield" }}}, //"unwind" map to array of {k: key, v: value} objects
{$match: {mymapfield: {k: "my.key.with.dot", v: "myvalue"}}} //query
])
If possible, it could be worth inserting documents using \u002e instead of the dot, that way you can query them while retaining the ASCII values of the . for any client rendering.
However, It appears there's a work around to query them like so:
db.collection.aggregate({
$match: {
"BusyBox 1.12.1" : "<value>"
}
})
You should be able to use $eq operator to query fields with dots in names.

How to generate regex patterns in python using re.compile

I am trying to create a python code that will be able to extract the information from strings such as the one below, using regular expressions.
date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic" subtype="forward" level="notice" vd="root" eventtime=1572127141 srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan" dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan" poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6 action="accept" policyid=4 policytype="policy" service="HTTPS" dstcountry="United States" srccountry="Reserved" trandisp="snat" transip=168.168.140.247 transport=294 appid=537 app="Google.Ads" appcat="General.Interest" apprisk="elevated" applist="Seniors" appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19 rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux" devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d" srcserver=0
I found someone's code on github and he uses the lines below to extract the information, however, his code doesn't extract all of the fields I require, most notably srcip=192.168.1.105
I don't want to post the guy's entire code as it's not mine. However, if it is required I can.
I am hoping all the fields will be extracted from the jumble of information so I can save them as a .csv file.
The regex \w+=([^\s"]+|"[^"]*") matches
The field name (at least one word character), then
An = sign, then
Either:
An unquoted field value (at least one character, excluding whitespace and quotes), or
A quoted field value (", then any number of non-quotes, then ").
By adding parentheses around the parts of the regex which match field name, and the unquoted and quoted values, we can extract the relevant parts and put them into a dictionary using a comprehension, using the findall method:
import re
pattern = re.compile(r'(\w+)=(([^\s"]+)|"([^"]*)")')
def parse_fields(text):
return {
name: (value or quoted_value)
for name,_,value,quoted_value in pattern.findall(text)
}
Same as kaya3, but I don't keep the quotes
s = '''date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic"
subtype="forward" level="notice" vd="root" eventtime=1572127141
srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan"
dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan"
poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6
action="accept" policyid=4 policytype="policy" service="HTTPS"
dstcountry="United States" srccountry="Reserved" trandisp="snat"
transip=168.168.140.247 transport=294 appid=537 app="Google.Ads"
appcat="General.Interest" apprisk="elevated" applist="Seniors"
appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19
rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux"
devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d"
srcserver=0'''
import re
matches = re.findall(r'([a-zA-Z_][a-zA-Z0-9_]*)=(?:"([^"]+)"|(\S+))', s)
d = {
name: quoted or unquoted
for name, quoted, unquoted in matches
}

Regular expression match U-SQL

I need a help in writing in U-SQL to output records to two different files based on a regular expression output.
Let me explain my scenario in detail.
Let us assume my input file has two columns, "Name" and person identification number ("PIN"):
Name , PIN
John ,12345
Harry ,01234
Tom, 24659
My condition for PIN is it should start with either 1 or 2. In the above case records 1 & 3 are valid and record 2 is invalid.
I need to output record 1 & 3 to my output processed file and 2 to my error file
How can I do this and also can I use Regex.Match to validate the regular expression?
//posting my code
#person =
EXTRACT UserId int,
PNR string,
UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();
#rs1=select UserId,PNR,UID,Regex.match(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.match(UID,'^(19|20)[0-9]{2}$') as uidval
from #person
#rs2 = select UserId,PNR,UID from #rs1 where pnrval=true or uidval=true
#rs3 = select UserId,PNR,UID from #rs1 where uidval=false or uidval= false
OUTPUT #rs2
TO "/output/sl.csv"
USING Outputters.Csv();
OUTPUT #rs3
TO "/output/error.csv"
USING Outputters.Csv();
But I'm receiving this error:
Severity Code Description Project File Line Suppression State Error
E_CSC_USER_INVALIDCOLUMNTYPE: 'System.Text.RegularExpressions.Match'
cannot be used as column type.
#someData =
SELECT * FROM
( VALUES
("John", "12345"),
("Harry", "01234"),
("Tom", "24659")
) AS T(Name, pin);
#result1 =
SELECT Name,
pin
FROM #someData
WHERE pin.StartsWith("1") OR pin.StartsWith("2");
#result2 =
SELECT Name,
pin
FROM #someData
WHERE !pin.StartsWith("1") AND !pin.StartsWith("2");
#person =
EXTRACT UserId int,
PNR string,
UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();
#rs1=select UserId,PNR,UID,Regex.Ismatch(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.Ismatch(UID,'^(19|20)[0-9]{2}$') as uidval
from #person
#rs2 = select UserId,PNR,UID from #rs1 where pnrval=true or uidval=true
#rs3 = select UserId,PNR,UID from #rs1 where pnrval=false or uidval= false
OUTPUT #rs2
TO "/output/sl.csv"
USING Outputters.Csv();
OUTPUT #rs3
TO "/output/error.csv"
USING Outputters.Csv();
This worked for my requirement. Thanks for the support and suggestions
Considering your input, I would use
.*\s*,\s*[12]\d+
.* matches any amount of characters and is needed to match everything before the comma
\s*,\s* matches a comma optionally preceded and or followed by any amount of blanks (\s matches a blank)
[12] matches a single digit, equal to 1 or 2; this satisfies your requirement about PINs
\d+ matches one or more digits
Live demo here.
As far as using Regex.Match, I'll quote this answer on StackOverflow:
System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.
So what I would do here is pre-parsing your CSV in C#; something like:
Regex CurrentRegex = new Regex(#".*\s*,\s*[12]\d+", RegexOptions.IgnoreCase);
foreach (var LineOfText in File.ReadAllLines(InputFilePath))
{
Match CurrentMatch = CurrentRegex.Match(LineOfText);
if (CurrentMatch.Success)
{
// Append line to success file
}
else
{
// Append line to error file
}
CurrentMatch = CurrentMatch.NextMatch();
}

MongoDB aggregation pipeline - matching $regex pattern of _id and setting as value of another field

Prior to the projection stage of my pipeline, my document ids are of the type:
"manchester_10:2016-09-28 09"
"burnley_10:2016-09-28 09"
In the projection stage, I'm trying to project new fields 'location' and 'date_hour' but having problems trying to regex match from the _id value and effectively splitting them by the : into two seperate key-value pairs such as this:
location: "manchester_10"
date_hour: "2016-09-28 09"
So far I've tried a number of things, with my $project query as this:
'location': { '$_id': { $regex: /.+?(?=:)/}},
The regex pattern is correct to match anything upto the colon, but I'm stumped as to how exactly to write the query...any ideas?
Try This:
Assuming it is for single line string
/([^:]*):(\d{4}-\d{2}-\d{2} \d{2})/g
Here group 1 is location
and group 2 is date

Scala Splitting with Regex removing the fields instead of splitting

I am trying to parse some logs that look like this:
2016-05-16 04:15:16,842 INFO org.apache.hadoop.hive.ql.log.PerfLogger: [pool-3-thread-194]: <PERFLOG method=get_database from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
2016-05-16 04:15:16,842 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: [pool-3-thread-194]: 154: get_database: newcluster
I am just messing around with how to split this file and every attempt I've made so far like this:
val split = hive.map(x=>x.split(":?(\\d{4})")).take(1)
Removes the first 4 digits instead of splitting it at the 4 digits?
split: Array[Array[String]] = Array(Array("", -05-13 00:37:50,808 INFO org.apache.hadoop.hive.ql.log.PerfLogger: [pool-3-thread-194]: </PERFLOG method=drop_table_with_environment_context start=, "", "", 5 end=, "", "", 8 duration=73 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=154 retryCount=0 error=false>))
Why is it removing the field? I have a more complex Regex I've built but just removes everything...
So make the argument to split instead "[- :,]"? What does that give you?
Here is how to create a regex that extract the date pattern:
val pattern = """\d{4}-\d{2}-\d{2}""".r
val log = """2016-05-16 04:15:16,842 INFO org.apache.hadoop.hive.ql.log.PerfLogger: [pool-3-thread-194]: <PERFLOG method=get_database from=org.apache.h"""
pattern.findFirstIn(log)
res: Option[String] = Some(2016-05-16)
Following this pattern should help you parse any elements from the log you need.
See here for more examples.