Spark Regex extraction - regex

Have a string below in a table. The content inside messageBody is a JSON string. How to read using a Spark and extract the JSON inside messageBody
Input Data:
{
"audit_id": "",
"audit_name": "GFSVpFeox/KrjEpFkIELEgltPGcqVU7/I0Oh9iVfdWA=",
"audit_info": "ingest-eventss",
"messageBody": "{\"Id\":\"8607379a-348b-4fdd-909e-80b85ac402d1\",\"EventId\":31,\"EventName\":\"LandingPage\",\"TriggerId\":38,\"TriggerName\":\"Agent.StartInterview\",\"TopicId\":5,\"TopicName\":\"businessevents.data\",\"SourceAppId\":22,\"SourceAppName\":\"TEST\",\"EventCorrelationId\":\"e3f091d9-86cf-4516-a173-22d891e1f20a\",\"Environment\":\"en1\",\"Timestamp\":\"2022-04-15T20:11:48.9505708Z\",\"Detail\":{\"Data\":{\"LineContent\":\"Business\",\"ReferenceNumber\":\"6834555\"}}},
"partitionKey": null,
"replyTo": null
}
Expected Output:
audit_info
messageBody
ingest-eventss
{"Id":"8607379a-348b-4fdd-909e-80b85ac402d1","EventId":31,"EventName":"LandingPage","TriggerId":38,"TriggerName":"Agent.StartInterview","TopicId":5,"TopicName":"businessevents.data","SourceAppId":22,"SourceAppName":"TEST","EventCorrelationId":"e3f091d9-86cf-4516-a173-22d891e1f20a","Environment":"en1","Timestamp":"2022-04-15T20:11:48.9505708Z","Detail":{"Data":{"LineContent":"Business","ReferenceNumber":"6834555"}}}
Need to do this in Spark 3. Any regexp_extract or Split function ? the Split seems hard as the delimiter : is inside the json message as well.

You might need this, but it's not recommended because it's not maintainable
find:
{([^\\]*)"{\\"Id\\":\\"([^\\]*)\\",\\"EventId\\":([^,]*),\\"EventName\\":\\"([^\\]*)\\",\\"TriggerId\\":([^,]*),\\"TriggerName\\":\\"([^\\]*)\\",\\"TopicId\\":([^,]*),\\"TopicName\\":\\"([^\\]*)\\",\\"SourceAppId\\":([^,]*),\\"SourceAppName\\":\\"([^\\]*)\\",\\"EventCorrelationId\\":\\"([^\\]*)\\",\\"Environment\\":\\"([^\\]*)\\",\\"Timestamp\\":\\"([^\\]*)\\",\\"Detail\\":{\\"([^\\]*)\\":{\\"LineContent\\":\\"([^\\]*)\\",\\"ReferenceNumber\\":\\"([^\\]*)\\"}}}([^}]*)}
replace:
{"Id":"$2","EventId":$3,"EventName":"$4","TriggerId":$5,"TriggerName":"$6","TopicId":$7,"TopicName":"$8","SourceAppId":$9,"SourceAppName":"($10)","EventCorrelationId":"$11","Environment":"$12","Timestamp":"$13","Detail":{"$14":{"LineContent":"$15","ReferenceNumber":"$16"}}}
I'm not familiar with Spark syntax, so maybe I'll do something like this: regexp_replace(string A, string B, string C) Replace the part of string A that conforms to Java regular expression B with C.
Why not use Python or Java dict to concatenate strings?

Related

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

How can I strip a href attribute without the query?

Using Google Sheets, I'd like to grab a URL without a possible query from a "href" attribute. For example, get https://test.com from Test1 or Test1.
I've used the regex answer offered in https://stackoverflow.com/a/40426187/4829915 to remove the query string, and then extracted the actual URL.
Is there a way to do it in one formula?
Please see below what I did. In all of these examples the final output is https://test.com
A B C
1 \?[^\"]+ href="(.+)"
2 Test1 =REGEXREPLACE(A2, B$1, "") =REGEXEXTRACT(B2, C$1)
3 Test2 =REGEXREPLACE(A3, B$1, "") =REGEXEXTRACT(B3, C$1)
4 Test3 =REGEXREPLACE(A4, B$1, "") =REGEXEXTRACT(B4, C$1)
In this answer, I would like to propose 2 patterns. In the 1st pattern, it uses REGEXEXTRACT. In the 2nd pattern, it uses a custom function using Google Apps Script (This is a sample.).
Pattern 1: Using formula
=REGEXEXTRACT(A2, C1)
where C1 is href="(.+?)[\?"]
Pattern 2: Using custom function
When you use this, please copy and paste the script to the script editor. Then please use it at a cell like =getUrl(A2).
function getUrl(value) {
var obj = XmlService.parse(value.replace(/&/g, ";"));
var url = obj.getRootElement().getAttribute("href").getValue();
return url.split("?")[0];
}
Results:
References:
REGEXEXTRACT
XmlService

How to find the day of the week from timestamp

I have a timestamp 2015-11-01 21:45:25,296 like I mentioned above. is it possible to extract the the day of the week(Mon, Tue,etc) using any regular expression or grok pattern.
Thanks in advance
this is quite easy if you want to use the ruby filter. I am lazy so I am only doing this.
Here is my filter:
filter {
ruby {
code => "
p = Time.parse(event['message']);
event['day-of-week'] = p.strftime('%A');
"
}
}
The 'message' variable is the field that contains your timestamp
With stdin and stdout and your string, you get:
artur#pandaadb:~/dev/logstash$ ./logstash-2.3.2/bin/logstash -f conf2/
Settings: Default pipeline workers: 8
Pipeline main started
2015-11-01 21:45:25,296
{
"message" => "2015-11-01 21:45:25,296",
"#version" => "1",
"#timestamp" => "2016-08-03T13:07:31.377Z",
"host" => "pandaadb",
"day-of-week" => "Sunday"
}
Hope that is what you need,
Artur
What you want is:
Assuming your string is 2015-11-01 21:45:25,296
mydate='2015-11-01'
date +%a -d ${mydate% *}
Will give you what you want.
Short answer is not, you can't.
A regex, according to Wikipedia:
...is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.
So, a regex allow you to parse a String, it searches for information within the String, but it doesn't make calculations over it.
If you want to make such calculations you need help from a programming language (java, c#, or Ruby[like #pandaadb suggested] etc) or some other tool that makes those calculations (Epoch Converter).

extract value from file USING PYTHON

I have one file like
"keyName":"type","start":{"row":42,"column":0},"end":{"row":42,"column":3},
"keyName":"left","start":{"row":42,"column":0},"end":{"row":42,"column":3},
I need to extract all the values of keyname like "keyName":"type" and "keyName":"left" excluding the other values using Python.
This is a very simple solution just parsing it as a text file.
I do not know what you really have to do but can be efficient enough!
text = open("pathToMyFile", 'r').read()
split = text.split(",")
for x in split:
if "keyName" in x:
print x

Shell script to extract specific data from a file

Given a text file containing records of the following form:
.....
feGroup1Person1 Person ::= {
id 1011,
uniquename "name1",
data 40,
moredata 100
}
feGroup1Person2 Person ::= {
id 5223,
uniquename "name2",
data 40,
moredata 200
}
.......
In a shell script, how could I go about extracting the Group and Person IDs for a particular uniquename?
For Example: Given "name2", I want to extract "feGroup1Person2".
I'm assuming some regular expressions will be required, but I'm not having any luck with it.
Any help appreciated
> awk '$0~/Person ::= \{/{x=$1; print x}' file
feGroup1Person1
feGroup1Person2
>
If you just want the group id you can use below:
for example you want the group is of person whose name is "name2",then:
awk '/name2/{print x2}{x2=x1;x1=x;x=$1}' file
feGroup1Person2
if name is "name1"
awk '/name1/{print x2}{x2=x1;x1=x;x=$1}' file
feGroup1Person1
You don't want to use shell scripting for this. You need to use something like Perl, VBScript, PowerShell or one of the many other more sophisticated scripting languages.
Which you use will depend primarily on your platform. On Windows try VBScript as a first choice. On Linux, try Perl first.
Don't attempt to formulate a solution entirely in terms of regular expressions. Your problem is sufficiently complex that regexes alone are not a wise choice of tool.
With a bit of manipulation, you could make this look like data in the JSON format and then parse it using a JSON parser. Any decent programming language (Python, Perl, Ruby...) should come with a JSON parser.
{
"feGroup1Person1" : {
"id" = 1011,
"uniquename" = "name1",
"data" = 40,
"moredata" = 100
}
"feGroup1Person2" :
{
"id" : 5223,
"uniquename" : "name2",
"data" : 40,
"moredata" : 200
}
}