Shell script to extract specific data from a file - regex

Given a text file containing records of the following form:
.....
feGroup1Person1 Person ::= {
id 1011,
uniquename "name1",
data 40,
moredata 100
}
feGroup1Person2 Person ::= {
id 5223,
uniquename "name2",
data 40,
moredata 200
}
.......
In a shell script, how could I go about extracting the Group and Person IDs for a particular uniquename?
For Example: Given "name2", I want to extract "feGroup1Person2".
I'm assuming some regular expressions will be required, but I'm not having any luck with it.
Any help appreciated

> awk '$0~/Person ::= \{/{x=$1; print x}' file
feGroup1Person1
feGroup1Person2
>
If you just want the group id you can use below:
for example you want the group is of person whose name is "name2",then:
awk '/name2/{print x2}{x2=x1;x1=x;x=$1}' file
feGroup1Person2
if name is "name1"
awk '/name1/{print x2}{x2=x1;x1=x;x=$1}' file
feGroup1Person1

You don't want to use shell scripting for this. You need to use something like Perl, VBScript, PowerShell or one of the many other more sophisticated scripting languages.
Which you use will depend primarily on your platform. On Windows try VBScript as a first choice. On Linux, try Perl first.

Don't attempt to formulate a solution entirely in terms of regular expressions. Your problem is sufficiently complex that regexes alone are not a wise choice of tool.
With a bit of manipulation, you could make this look like data in the JSON format and then parse it using a JSON parser. Any decent programming language (Python, Perl, Ruby...) should come with a JSON parser.
{
"feGroup1Person1" : {
"id" = 1011,
"uniquename" = "name1",
"data" = 40,
"moredata" = 100
}
"feGroup1Person2" :
{
"id" : 5223,
"uniquename" : "name2",
"data" : 40,
"moredata" : 200
}
}

Related

Spark Regex extraction

Have a string below in a table. The content inside messageBody is a JSON string. How to read using a Spark and extract the JSON inside messageBody
Input Data:
{
"audit_id": "",
"audit_name": "GFSVpFeox/KrjEpFkIELEgltPGcqVU7/I0Oh9iVfdWA=",
"audit_info": "ingest-eventss",
"messageBody": "{\"Id\":\"8607379a-348b-4fdd-909e-80b85ac402d1\",\"EventId\":31,\"EventName\":\"LandingPage\",\"TriggerId\":38,\"TriggerName\":\"Agent.StartInterview\",\"TopicId\":5,\"TopicName\":\"businessevents.data\",\"SourceAppId\":22,\"SourceAppName\":\"TEST\",\"EventCorrelationId\":\"e3f091d9-86cf-4516-a173-22d891e1f20a\",\"Environment\":\"en1\",\"Timestamp\":\"2022-04-15T20:11:48.9505708Z\",\"Detail\":{\"Data\":{\"LineContent\":\"Business\",\"ReferenceNumber\":\"6834555\"}}},
"partitionKey": null,
"replyTo": null
}
Expected Output:
audit_info
messageBody
ingest-eventss
{"Id":"8607379a-348b-4fdd-909e-80b85ac402d1","EventId":31,"EventName":"LandingPage","TriggerId":38,"TriggerName":"Agent.StartInterview","TopicId":5,"TopicName":"businessevents.data","SourceAppId":22,"SourceAppName":"TEST","EventCorrelationId":"e3f091d9-86cf-4516-a173-22d891e1f20a","Environment":"en1","Timestamp":"2022-04-15T20:11:48.9505708Z","Detail":{"Data":{"LineContent":"Business","ReferenceNumber":"6834555"}}}
Need to do this in Spark 3. Any regexp_extract or Split function ? the Split seems hard as the delimiter : is inside the json message as well.
You might need this, but it's not recommended because it's not maintainable
find:
{([^\\]*)"{\\"Id\\":\\"([^\\]*)\\",\\"EventId\\":([^,]*),\\"EventName\\":\\"([^\\]*)\\",\\"TriggerId\\":([^,]*),\\"TriggerName\\":\\"([^\\]*)\\",\\"TopicId\\":([^,]*),\\"TopicName\\":\\"([^\\]*)\\",\\"SourceAppId\\":([^,]*),\\"SourceAppName\\":\\"([^\\]*)\\",\\"EventCorrelationId\\":\\"([^\\]*)\\",\\"Environment\\":\\"([^\\]*)\\",\\"Timestamp\\":\\"([^\\]*)\\",\\"Detail\\":{\\"([^\\]*)\\":{\\"LineContent\\":\\"([^\\]*)\\",\\"ReferenceNumber\\":\\"([^\\]*)\\"}}}([^}]*)}
replace:
{"Id":"$2","EventId":$3,"EventName":"$4","TriggerId":$5,"TriggerName":"$6","TopicId":$7,"TopicName":"$8","SourceAppId":$9,"SourceAppName":"($10)","EventCorrelationId":"$11","Environment":"$12","Timestamp":"$13","Detail":{"$14":{"LineContent":"$15","ReferenceNumber":"$16"}}}
I'm not familiar with Spark syntax, so maybe I'll do something like this: regexp_replace(string A, string B, string C) Replace the part of string A that conforms to Java regular expression B with C.
Why not use Python or Java dict to concatenate strings?

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

How to find the day of the week from timestamp

I have a timestamp 2015-11-01 21:45:25,296 like I mentioned above. is it possible to extract the the day of the week(Mon, Tue,etc) using any regular expression or grok pattern.
Thanks in advance
this is quite easy if you want to use the ruby filter. I am lazy so I am only doing this.
Here is my filter:
filter {
ruby {
code => "
p = Time.parse(event['message']);
event['day-of-week'] = p.strftime('%A');
"
}
}
The 'message' variable is the field that contains your timestamp
With stdin and stdout and your string, you get:
artur#pandaadb:~/dev/logstash$ ./logstash-2.3.2/bin/logstash -f conf2/
Settings: Default pipeline workers: 8
Pipeline main started
2015-11-01 21:45:25,296
{
"message" => "2015-11-01 21:45:25,296",
"#version" => "1",
"#timestamp" => "2016-08-03T13:07:31.377Z",
"host" => "pandaadb",
"day-of-week" => "Sunday"
}
Hope that is what you need,
Artur
What you want is:
Assuming your string is 2015-11-01 21:45:25,296
mydate='2015-11-01'
date +%a -d ${mydate% *}
Will give you what you want.
Short answer is not, you can't.
A regex, according to Wikipedia:
...is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.
So, a regex allow you to parse a String, it searches for information within the String, but it doesn't make calculations over it.
If you want to make such calculations you need help from a programming language (java, c#, or Ruby[like #pandaadb suggested] etc) or some other tool that makes those calculations (Epoch Converter).

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

extract value from file USING PYTHON

I have one file like
"keyName":"type","start":{"row":42,"column":0},"end":{"row":42,"column":3},
"keyName":"left","start":{"row":42,"column":0},"end":{"row":42,"column":3},
I need to extract all the values of keyname like "keyName":"type" and "keyName":"left" excluding the other values using Python.
This is a very simple solution just parsing it as a text file.
I do not know what you really have to do but can be efficient enough!
text = open("pathToMyFile", 'r').read()
split = text.split(",")
for x in split:
if "keyName" in x:
print x