Scala Splitting with Regex removing the fields instead of splitting - regex

I am trying to parse some logs that look like this:
2016-05-16 04:15:16,842 INFO org.apache.hadoop.hive.ql.log.PerfLogger: [pool-3-thread-194]: <PERFLOG method=get_database from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
2016-05-16 04:15:16,842 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: [pool-3-thread-194]: 154: get_database: newcluster
I am just messing around with how to split this file and every attempt I've made so far like this:
val split = hive.map(x=>x.split(":?(\\d{4})")).take(1)
Removes the first 4 digits instead of splitting it at the 4 digits?
split: Array[Array[String]] = Array(Array("", -05-13 00:37:50,808 INFO org.apache.hadoop.hive.ql.log.PerfLogger: [pool-3-thread-194]: </PERFLOG method=drop_table_with_environment_context start=, "", "", 5 end=, "", "", 8 duration=73 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=154 retryCount=0 error=false>))
Why is it removing the field? I have a more complex Regex I've built but just removes everything...

So make the argument to split instead "[- :,]"? What does that give you?

Here is how to create a regex that extract the date pattern:
val pattern = """\d{4}-\d{2}-\d{2}""".r
val log = """2016-05-16 04:15:16,842 INFO org.apache.hadoop.hive.ql.log.PerfLogger: [pool-3-thread-194]: <PERFLOG method=get_database from=org.apache.h"""
pattern.findFirstIn(log)
res: Option[String] = Some(2016-05-16)
Following this pattern should help you parse any elements from the log you need.
See here for more examples.

Related

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

How to extract parts of logs based on identification numbers?

I am trying to extract and preprocess log data for a use case.
For instance, the log consists of problem numbers with information to each ID underneath. Each element starts with:
#!#!#identification_number###96245#!#!#change_log###
action
action1
change
#!#!#attribute###value_change
#!#!#attribute1###status_change
#!#!#attribute2###<None>
#!#!#attribute3###status_change_fail
#!#!#attribute4###value_change
#!#!#attribute5###status_change
#!#!#identification_number###96246#!#!#change_log###
action
change
change1
action1
#!#!#attribute###value_change
#!#!#attribute1###status_change_fail
#!#!#attribute2###value_change
#!#!#attribute3###status_change
#!#!#attribute4###value_change
#!#!#attribute5###status_change
I extracted the identification numbers and saved them as a .csv file:
f = open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8")
change_log = f.readlines()
number = re.findall('#!#!#identification_number###(.+?)#!#!#change_log###', change_log)
Now what I am trying to achieve is, that for every ID in the .csv file I can append the corresponding log content, which is:
action
change
#!#!#attribute###
Since I am rather new to Python and only started working with regex a few days ago, I was hoping for some help.
Each log for an ID starts with "#!#!identification_number###" and ends with "#!#!attribute5### <entry>".
I have tried the following code, but the result is empty:
In:
x = re.findall("\[^#!#!#identification_number###((.|\n)*)#!#!#attribute5###((.|\n)*)$]", str(change_log))
In:
print(x)
Out:
[]
Try this:
pattern='entification_number###(.+?)#!#!#change_log###(.*?)#!#!#id'
re.findall(pattern, string+'#!#!#id', re.DOTALL)
The dotall flag makes the point match newline, so hopefully in the second capturing group you will find the logs.
If you want to get the attributes, for each identification number, you can parse the logs (got for the search above) of each id number with the following:
pattern='#!#!#attribute(.*?)###(.*?)#!#'
re.findall(pattern, string_for_each_log_match+'#!#', re.DOTALL)
If you put each id into the regex when you search using string.format() you can grab the lines that contain the correct changelog.
with open(r'path\to\csv.csv', 'r') as f:
ids = f.readlines()
with open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8") as f:
change_log = f.readlines()
matches = {}
for id_no in ids:
for i in range(len(change_log)):
reg = '#!#!#identification_number###({})#!#!#change_log###'.format(id_no)
if re.search(reg, change_log[i]):
matches[id_no] = i
break
This will create a dictionary with the structure {id_no:line_no,...}.
So once you have all of the lines that tell you where each log starts, you can grab the lines you want that come after these lines.

split string using regex python and "re" package

I'm using Python 3 on Windows 10. Consider the following string:
import re
s = ["12345", "67891", "01112"]
I want to split these zips at the 3 character to get the zip3, but this code throws an error.
re.split("\d{3}", s)
TypeError: cannot use a string pattern on a bytes-like object
I'm not quite sure how to get around. Help appreciated. Thanks.
To get the first three of each, simply string-slice them:
s = ["12345", "67891", "01112"]
first_tree = [p[0:3] for p in s]
print(first_tree)
Outtput:
['123', '678', '011'] # slicing
To split all text in threes, join it, then use chunking to get chunks of 3 letters:
s = ["12345", "67891", "01112"]
k = ''.join(s)
threesome = [k[i:i+3] for i in range(0,len(k),3)]
print(threesome)
Outtput:
['123', '456', '789', '101', '112'] # join + chunking
See How do you split a list into evenly sized chunks? and Understanding Python's slice notation
Slicing and chunking works on strings as well - the official doku about strings is here: about strings and slicing
To get the remainder as well:
s = ["12345", "67891", "01112"]
three_and_two = [[p[:3], p[3:]] for p in s]
print(three_and_two) # [['123', '45'], ['678', '91'], ['011', '12']]

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]

Perl regex.. match words exactly 2 times...Input is a JSON file

I am a beginner for any sort of regex. I need your help/pointers in resolving an issue. I have a JSON file which looks like this below.
JSON format
{"record-type":"int-stats","time":1389309548046925,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548041555,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548041554,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046151,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548041667,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548042626,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548035666,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548035635,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548042255,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548041715,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046161,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548023422,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548041617,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046676,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548045675,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046172,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548034534,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548012345,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548025232,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548023423,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548252352,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
I need to extract "port":"ab-0/0/44" and associated "time" with that port. I am trying to calculate the time difference for any two such occurrences, i.e 1st occurrence-> "time":1389309548046925 "port":"ab-0/0/44" 2nd occurrence -> "time":1389309548041555 "port":"ab-0/0/44". The calculated time difference must be stored in a variable. I tried with a regular expression like this /\"time\":\\d+\.*\"port\":\".b-0\/0\/44\"/. Any help is appreciated. Thanks in advance!
Use the JSON module. It's rather simple.
use strict;
use warnings;
use JSON;
while (<>) {
/\S/ or next;
my $data = decode_json($_);
print "port -> $data->{port}\n";
print "time -> $data->{time}\n";
}
With your data, I get output like this:
port -> ab-0/0/44
time -> 1389309548046925
port -> ab-0/0/45
time -> 1389309548046925
... etc
I'm not sure how you want to calculate your time, but I assume that doing arithmetic is something you can figure out best on your own.