Splunk search Regex: to filter timestamp and userId - regex

Below the text want to extract timestamp align with UserId from the below line and group it
2020-10-12 12:30:22.540 INFO 1 --- [enerContainer-4] c.t.t.o.s.s.UserPrepaidService : Validating the user with UserID:1111 systemID:sys111
From below whole logs
2020-10-12 12:30:22.538 INFO 1 --- [ener-4] c.t.t.o.s.service.UserService : AccountDetails":[{"snumber":"2222","sdetails":[{"sId":"0474889018","sType":"Java","plan":[{"snumber":"sdds22"}]}]}]}
2020-10-12 12:30:22.538 INFO 1 --- [ener-4] c.t.t.o.s.service.ReceiverService : Received userType is:Normal
2020-10-12 12:30:22.540 INFO 1 --- [enerContainer-4] c.t.t.o.s.s.UserPrepaidService : Validating the user with UserID:1111 systemID:sys111
2020-10-12 12:30:22.540 INFO 1 --- [enerContainer-4] c.t.t.o.s.util.CommonUtil : The Code is valid for userId: 1111 systemId: sys111
2020-10-12 12:30:22.577 INFO 1 --- [enerContainer-4] c.t.t.o.s.r.Dao : Saving user into dB ..... with User-ID:1111
....
same repetitive line
Below is my SPL search commands it returns only userid group by from that specific line.
But I want the time stamp as well from that line and group by it with time chart
index="tis" logGroup="/ecs/logsmy" "logEvents{}.message"="*Validating the user with UserID*" | spath output=myfield path=logEvents{}.message | rex field=myfield "(?<=Validating the user with UserID:)(?<userId>[0-9]+)(?= systemID:)" | table userId | dedup userId | stats count values(userId) by userId
Basically I tired the below
(^(?<dtime>\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2}\.\d+) )(?<=Validating the user with UserID:)(?<userId>[0-9]+)(?= systemID:)
but it gave all the time stamp not specifically the line I mentioned above

You placed the lookaround right after matching the timestamp pattern, but you have to first move to the postition where the lookbehind is true.
If you want both values, you can match Validating the user with UserID: and systemID: instead of using a lookaround.
If there are leading whitspace chars, you could match them with \s or [^\S\r\n]*
^\s*(?<dtime>\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2}\.\d+).*\bValidating the user with UserID:(?<userId>[0-9]+) systemID:
Regex demo

Related

PowerShell script to concatenate regex and variable

I want to write PowerShell script and regex to audit several network devices configuration files for compliance. Some devices are configured one management vlan while others have multiple different management vlans. Examples below
Config1:
VLAN Name Status Ports
1 default active
100 12_NET_MGMT_VLAN active Gi1/2
Config2:
VLAN Name Status Ports
1 default active
88 100_MGMT-VLLAN active Gi8/1
100 12_Net_MGMT_VLAN active
If I hard code the regex pattern like this $regex_pattern = "^\d{1,3}\s+.*MGMT.*", I got the corrected output as expected
Config1 12_NET_MGMT_VLAN
Config2 100_MGMT_VLAN
Config2 12_Net_MGMT_VLAN
Instead of hard-code the regex pattern, I want to use the Read-Host cmdlet and ask a user to enter the word "MGMT" and store it in a variable $Mgmt, then concatenate with a regex pattern to create a dynamic regex pattern, like this:
$Mgmt = Read-Host "Enter a word pattern to find a management vlan: "
For example, a user type in MGMT, and then I created a dynamic regex pattern as below:
$regex_pattern = "^\d{1,3}\s+.*"+$Mgmt+"_.*"
$regex_pattern = "^\d{1,3}\s+.*"+[regex]::escape($Mgmt)+".*"
None of the results came out correct
If anyone has a solution, please help. Thx
If we are to assume that VLAN names cannot contain spaces, you can use \S (non-space) as an anchor character. Using the subexpression operator $(), you can evaluate an expression within a string.
# Simulating a vlan config output
$Config = #'
VLAN Name Status Ports
1 default active
88 100_MGMT-VLLAN active Gi8/1
100 12_Net_MGMT_VLAN active
'# -split '\r?\n'
# Using value MGMT here when prompted
$Mgmt = Read-Host "Enter a word pattern to find a management vlan"
$regex = "\S*$([regex]::Escape($Mgmt))\S*"
[regex]::Matches($Config,$regex).Value
Output:
100_MGMT-VLLAN
12_Net_MGMT_VLAN
Note that simple variable references like $Mgmt will expand properly within surrounding double quotes, e.g. "My VLAN is $Mgmt".
You could take this in a different direction and create a custom object from your output. This would enable you to use filtering via Where-Object and member access (.Property) to retrieve target data. This again assumes values don't contain spaces.
# Simulating a vlan config output
$Config = #'
VLAN Name Status Ports
1 default active
88 100_MGMT-VLLAN active Gi8/1
100 12_Net_MGMT_VLAN active
'# -split '\r?\n'
$Mgmt = Read-Host "Enter VLAN Name"
# Replacing consecutive spaces with , first
$ConfigObjs = $Config -replace '\s+',',' | ConvertFrom-Csv
$ConfigObjs
Output:
VLAN Name Status Ports
---- ---- ------ -----
1 default active
88 100_MGMT-VLLAN active Gi8/1
100 12_Net_MGMT_VLAN active
Now you have properties that can be referenced and access to other comparison operators so you don't always need to use regex.
($ConfigObjs | Where Name -like "*$Mgmt*").Name
Output:
100_MGMT-VLLAN
12_Net_MGMT_VLAN

Regex for twitter data for hive

I have the following Twitter data.
The data divide in two part:
#Username
And tweet or text:
RT #username: Stay behind, or take the jump (anything in text or tags and emoji)##name
#name
Jjjjjjjjj
Dhdkeueh
Sjdyeh
#kdudiwi
.....
RT #username: thehdydvekdgeke
Hshedhdkdjfnfjfkfmfmhdkalshsh+£) #&#(#(£63+kdjdj😆🙃☺🙃☺😇
RT #username: this sing kdudhekhh juygg jyttt hyyg
£jdhdieo+3-) £) 7--uuueoehrmwowyeheldyejelwyej
Djdyegeleisyhekelsudhejwksi
This is the data
I want to divide the data in two part first is username and second is tweet.
The regex I make is:
^(RT\s[^ ]*)\s([\W]*[\H]*[\w\s##;:!?+(+-_#)]*)$
The first part is working but second part is not.
Anyone can help me?
with your_data as (
select 'RT #username: Stay behind, or take the jump (anything in text or tags and emoji)' as str
)
select regexp_extract(str,'^RT\\s(\\S*)\\s(.*)$',1) as username,
regexp_extract(str,'^RT\\s(\\S*)\\s(.*)$',2) as tweet
from your_data;
Result:
OK
username tweet
#username: Stay behind, or take the jump (anything in text or tags and emoji)
Time taken: 1.092 seconds, Fetched: 1 row(s)
Use '^RT\\s(\\S*):\\s(.*)$' if you do not want ':' in the username.
Or '^RT\\s(\\S*):?\\s(.*)$' if : is optional:
with your_data as (
select 'RT #username Stay behind, or take the jump (anything in text or tags and emoji)' as str
)
select regexp_extract(str,'^RT\\s(\\S*):?\\s(.*)$',1) as username,
regexp_extract(str,'^RT\\s(\\S*):?\\s(.*)$',2) as tweet
from your_data;
Result:
OK
username tweet
#username Stay behind, or take the jump (anything in text or tags and emoji)
Time taken: 28.587 seconds, Fetched: 1 row(s)

AWS's own example of submitting Pig job does not work due to issue with piggybank.jar

I have been trying to test out submitting Pig jobs on AWS EMR following Amazon's guide. I made the change to the Pig script to ensure that it can find the piggybank.jar as instructed by Amazon. When I run the script I get an ERROR 1070 indicated that one of functions available in piggybank cannot be resolved. Any ideas on what is going wrong?
Key part of error
2018-03-15 21:47:08,258 ERROR org.apache.pig.PigServer (main): exception
during parsing: Error during parsing. Could not resolve
org.apache.pig.piggybank.evaluation.string.EXTRACT using imports: [,
java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Failed to parse: Pig script failed to parse: <file s3://cis442f-
data/pigons3/do-reports4.pig, line 26, column 6> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.pig.piggybank.evaluation.string.EXTRACT using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
The first part of the script is as follows:
Line 26 referred to in the error is contains "EXTRACT("
register file:/usr/lib/pig/lib/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT;
DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE;
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME;
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT;
--
-- import logs and break into tuples
--
raw_logs =
-- load the weblogs into a sequence of one element tuples
LOAD '$INPUT' USING TextLoader AS (line:chararray);
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
EXTRACT(
line,
'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
The correct function name is REGEX_EXTRACT. So either change your DEFINE statement to
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.REGEX_EXTRACT;
OR use REGEX_EXTRACT directly in your pig script
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
REGEX_EXTRACT(
line,
'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
The original script from Amazon would not work because it relied on an older version of piggybank. Here is an updated version that does not need piggybank at all.
--
-- import logs and break into tuples
--
raw_logs =
-- load the weblogs into a sequence of one element tuples
LOAD '$INPUT' USING TextLoader AS (line:chararray);
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
REGEX_EXTRACT_ALL(
line,
'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
logs =
-- convert from string values to typed values such as date_time and integers
FOREACH
logs_base
GENERATE
*,
ToDate(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as dtime,
(int)REPLACE(bytes_string, '-', '0') as bytes
;
--
-- determine total number of requests and bytes served by UTC hour of day
-- aggregating as a typical day across the total time of the logs
--
by_hour_count =
-- group logs by their hour of day, counting the number of logs in that hour
-- and the sum of the bytes of rows for that hour
FOREACH
(GROUP logs BY GetHour(dtime))
GENERATE
$0,
COUNT($1) AS num_requests,
SUM($1.bytes) AS num_bytes
;
STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
--
-- top 50 X.X.X.* blocks
--
by_ip_count =
-- group weblog entries by the ip address from the remote address field
-- and count the number of entries for each address blok as well as
-- the sum of the bytes
FOREACH
(GROUP logs BY (chararray)REGEX_EXTRACT(remoteAddr, '(\\d+\\.\\d+\\.\\d+)', 1))
-- (GROUP logs BY block)
GENERATE $0,
COUNT($1) AS num_requests,
SUM($1.bytes) AS num_bytes
;
by_ip_count_sorted = ORDER by_ip_count BY num_requests DESC;
by_ip_count_limited =
-- order ip by the number of requests they make
LIMIT by_ip_count_sorted 50;
STORE by_ip_count_limited into '$OUTPUT/top_50_ips';
--
-- top 50 external referrers
--
by_referrer_count =
-- group by the referrer URL and count the number of requests
FOREACH
(GROUP logs BY (chararray)REGEX_EXTRACT(referrer, '(http:\\/\\/[a-z0-9\\.-]+)', 1))
GENERATE
FLATTEN($0),
COUNT($1) AS num_requests
;
by_referrer_count_filtered =
-- exclude matches for example.org
FILTER by_referrer_count BY NOT $0 matches '.*example\\.org';
by_referrer_count_sorted =
-- take the top 50 results
ORDER by_referrer_count_filtered BY $1 DESC;
by_referrer_count_limited =
-- take the top 50 results
LIMIT by_referrer_count_sorted 50;
STORE by_referrer_count_limited INTO '$OUTPUT/top_50_external_referrers';
--
-- top search terms coming from bing or google
--
google_and_bing_urls =
-- find referrer fields that match either bing or google
FILTER
(FOREACH logs GENERATE referrer)
BY
referrer matches '.*bing.*'
OR
referrer matches '.*google.*'
;
search_terms =
-- extract from each referrer url the search phrases
FOREACH
google_and_bing_urls
GENERATE
FLATTEN(REGEX_EXTRACT_ALL(referrer, '.*[&\\?]q=([^&]+).*')) as (term:chararray)
;
search_terms_filtered =
-- reject urls that contained no search terms
FILTER search_terms BY NOT $0 IS NULL;
search_terms_count =
-- for each search phrase count the number of weblogs entries that contained it
FOREACH
(GROUP search_terms_filtered BY $0)
GENERATE
$0,
COUNT($1) AS num
;
search_terms_count_sorted =
-- order the results
ORDER search_terms_count BY num DESC;
search_terms_count_limited =
-- take the top 50 results
LIMIT search_terms_count_sorted 50;
STORE search_terms_count_limited INTO '$OUTPUT/top_50_search_terms_from_bing_google';

String split with spaces using regex

I am trying to split a string using regex. I need to use regex in nifi to split a string into groups. Could anyone helps me how to split below string using regex.
or how can we give specific occurrence number of delimiter to split the string. For example in the below string how can I specify that I want a string after 3rd occurrence of space.
Suppose i have a String
"6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
And I want result something like this :
group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-
addr(4)arpa(0)
Could anyone help me. Thanks in advance.
If it's really just certain spaces you want to have for delimiters you can do something like this to avoid a fixed width nightmare:
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"
Pretty much it's what it looks like, groups of NON spaces \S+ with spaces \s and each is grouped with parans. The .* at the end is just the rest of the line, it can be adjusted as needed. If you wanted each group to be every non spaced group you can do a split instead of regex, but it looks like that isn't what is desired. I don't have access to nifi to test, but here is an example in Python.
import re
text = "6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"
match = re.search(regex, text)
print ("group 1 - " + match.group(1))
print ("group 2 - " + match.group(2))
print ("group 3 - " + match.group(3))
print ("group 4 - " + match.group(4))
print ("group 5 - " + match.group(5))
print ("group 6 - " + match.group(6))
print ("group 7 - " + match.group(7))
print ("group 8 - " + match.group(8))
Output:
group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)
Are you trying to extract every group into a separate attribute? This is certainly possible in "pure" NiFi, but with lines this long, it may make more sense to use ExecuteScript processor to use Groovy or Python's more complex regular expression handling in conjunction with String#split() and provide a script like sniperd posted.
To perform this task using ExtractText, you'll configure it as follows:
Copyable patterns:
group 1: (^\S+\s\S+\s\S+)
group 2: (?i)(?<=\s)([a-f0-9]{4})(?=\s)
group 3: (?i)(?<=\s)(PACKET\s[a-f0-9]{4,16})(?=\s)
group 4: (?i)(?<=\s\S{16}\s)([\w]{3,})(?=\s)
group 5: (?i)(?<=\s.{3}\s)([\w]{3,})(?=\s)
group 6: (?i)(?<=\s.{3}\s)([\d\.]{7,15})(?=\s)
group 7: (?i)(?<=\d\s)([a-f0-9]{4})(?=\s)
group 8: (?i)(?<=\d\s[a-f0-9]{4}\s)(.*)$
It is important to note that Include Capture Group 0 is set to false. You will get duplicate groups (group 1 and group 1.1) due to the way regex expressions are validated in NiFi (currently all regexes must have at least one capture group -- this will be fixed with NIFI-4095 | ExtractText should not require a capture group in every regular expression).
The resulting flowfile has the attributes properly populated:
Full log output:
2017-06-20 14:45:57,050 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=c6b04310-015c-1000-b21e-c64aec5b035e] logging for flow file StandardFlowFileRecord[uuid=5209cc65-08fe-44a4-be96-9f9f58ed2490,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1497984255809-1, container=default, section=1], offset=444, length=148],offset=0,name=1920315756631364,size=148]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'lineageStartDate'
Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'fileSize'
Value: '148'
FlowFile Attribute Map Content
Key: 'filename'
Value: '1920315756631364'
Key: 'group 1'
Value: '6/19/2017 12:14:07 PM'
Key: 'group 1.1'
Value: '6/19/2017 12:14:07 PM'
Key: 'group 2'
Value: '0FA0'
Key: 'group 2.1'
Value: '0FA0'
Key: 'group 3'
Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 3.1'
Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 4'
Value: 'UDP'
Key: 'group 4.1'
Value: 'UDP'
Key: 'group 5'
Value: 'Snd'
Key: 'group 5.1'
Value: 'Snd'
Key: 'group 6'
Value: '11.222.333.44'
Key: 'group 6.1'
Value: '11.222.333.44'
Key: 'group 7'
Value: '93c8'
Key: 'group 7.1'
Value: '93c8'
Key: 'group 8'
Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'group 8.1'
Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'path'
Value: './'
Key: 'uuid'
Value: '5209cc65-08fe-44a4-be96-9f9f58ed2490'
--------------------------------------------------
6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)
Another option with the release of NiFi 1.3.0 is to use the record processing capabilities. This is a new feature which allows arbitrary input formats (Avro, JSON, CSV, etc.) to be parsed and manipulated in a streaming manner. Mark Payne has written a very good tutorial here that introduces the feature and provides some simple walkthroughs.

How to match a group of lines that match a pattern

I am trying to filter out a group of lines that match a pattern using a regexp but am having trouble getting the correct regexp to use.
The text file contains lines like this:
transaction 390134; promote; 2016/12/20 01:17:07 ; user: build
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
# some commit comment
/./som/file/path 11745/409 (22269/257)
# merged
version 22269/257 (22269/257)
ancestor: (22133/182)
transaction 390136; promote; 2016/12/20 01:17:08 ; user: najmi
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
/./some/other/file/path 11745/1 (22269/1)
version 22269/1 (22269/1)
ancestor: (none - initial version)
type: dir
I would like to filter out the lines that start with "transaction", contain "User: build all the way until the next line that starts with "transaction".
The idea is to end up with transaction lines where user is not "build".
Thanks for any help.
If you want only the transaction lines for all users except build:
grep '^transaction ' test_data| grep -v 'user: build$'
If you want the whole transaction record for such users:
awk '/^transaction /{ p = !/user: build$/};p' test_data
OR
perl -lne 'if(/^transaction /){$p = !/user: build$/}; print if $p' test_data
The -A and -v options of grep command would have done the trick if all transaction records had same number of lines.