I am trying to split a string using regex. I need to use regex in nifi to split a string into groups. Could anyone helps me how to split below string using regex.
or how can we give specific occurrence number of delimiter to split the string. For example in the below string how can I specify that I want a string after 3rd occurrence of space.
Suppose i have a String
"6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
And I want result something like this :
group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-
addr(4)arpa(0)
Could anyone help me. Thanks in advance.
If it's really just certain spaces you want to have for delimiters you can do something like this to avoid a fixed width nightmare:
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"
Pretty much it's what it looks like, groups of NON spaces \S+ with spaces \s and each is grouped with parans. The .* at the end is just the rest of the line, it can be adjusted as needed. If you wanted each group to be every non spaced group you can do a split instead of regex, but it looks like that isn't what is desired. I don't have access to nifi to test, but here is an example in Python.
import re
text = "6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"
match = re.search(regex, text)
print ("group 1 - " + match.group(1))
print ("group 2 - " + match.group(2))
print ("group 3 - " + match.group(3))
print ("group 4 - " + match.group(4))
print ("group 5 - " + match.group(5))
print ("group 6 - " + match.group(6))
print ("group 7 - " + match.group(7))
print ("group 8 - " + match.group(8))
Output:
group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)
Are you trying to extract every group into a separate attribute? This is certainly possible in "pure" NiFi, but with lines this long, it may make more sense to use ExecuteScript processor to use Groovy or Python's more complex regular expression handling in conjunction with String#split() and provide a script like sniperd posted.
To perform this task using ExtractText, you'll configure it as follows:
Copyable patterns:
group 1: (^\S+\s\S+\s\S+)
group 2: (?i)(?<=\s)([a-f0-9]{4})(?=\s)
group 3: (?i)(?<=\s)(PACKET\s[a-f0-9]{4,16})(?=\s)
group 4: (?i)(?<=\s\S{16}\s)([\w]{3,})(?=\s)
group 5: (?i)(?<=\s.{3}\s)([\w]{3,})(?=\s)
group 6: (?i)(?<=\s.{3}\s)([\d\.]{7,15})(?=\s)
group 7: (?i)(?<=\d\s)([a-f0-9]{4})(?=\s)
group 8: (?i)(?<=\d\s[a-f0-9]{4}\s)(.*)$
It is important to note that Include Capture Group 0 is set to false. You will get duplicate groups (group 1 and group 1.1) due to the way regex expressions are validated in NiFi (currently all regexes must have at least one capture group -- this will be fixed with NIFI-4095 | ExtractText should not require a capture group in every regular expression).
The resulting flowfile has the attributes properly populated:
Full log output:
2017-06-20 14:45:57,050 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=c6b04310-015c-1000-b21e-c64aec5b035e] logging for flow file StandardFlowFileRecord[uuid=5209cc65-08fe-44a4-be96-9f9f58ed2490,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1497984255809-1, container=default, section=1], offset=444, length=148],offset=0,name=1920315756631364,size=148]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'lineageStartDate'
Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'fileSize'
Value: '148'
FlowFile Attribute Map Content
Key: 'filename'
Value: '1920315756631364'
Key: 'group 1'
Value: '6/19/2017 12:14:07 PM'
Key: 'group 1.1'
Value: '6/19/2017 12:14:07 PM'
Key: 'group 2'
Value: '0FA0'
Key: 'group 2.1'
Value: '0FA0'
Key: 'group 3'
Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 3.1'
Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 4'
Value: 'UDP'
Key: 'group 4.1'
Value: 'UDP'
Key: 'group 5'
Value: 'Snd'
Key: 'group 5.1'
Value: 'Snd'
Key: 'group 6'
Value: '11.222.333.44'
Key: 'group 6.1'
Value: '11.222.333.44'
Key: 'group 7'
Value: '93c8'
Key: 'group 7.1'
Value: '93c8'
Key: 'group 8'
Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'group 8.1'
Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'path'
Value: './'
Key: 'uuid'
Value: '5209cc65-08fe-44a4-be96-9f9f58ed2490'
--------------------------------------------------
6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)
Another option with the release of NiFi 1.3.0 is to use the record processing capabilities. This is a new feature which allows arbitrary input formats (Avro, JSON, CSV, etc.) to be parsed and manipulated in a streaming manner. Mark Payne has written a very good tutorial here that introduces the feature and provides some simple walkthroughs.
Related
Below the text want to extract timestamp align with UserId from the below line and group it
2020-10-12 12:30:22.540 INFO 1 --- [enerContainer-4] c.t.t.o.s.s.UserPrepaidService : Validating the user with UserID:1111 systemID:sys111
From below whole logs
2020-10-12 12:30:22.538 INFO 1 --- [ener-4] c.t.t.o.s.service.UserService : AccountDetails":[{"snumber":"2222","sdetails":[{"sId":"0474889018","sType":"Java","plan":[{"snumber":"sdds22"}]}]}]}
2020-10-12 12:30:22.538 INFO 1 --- [ener-4] c.t.t.o.s.service.ReceiverService : Received userType is:Normal
2020-10-12 12:30:22.540 INFO 1 --- [enerContainer-4] c.t.t.o.s.s.UserPrepaidService : Validating the user with UserID:1111 systemID:sys111
2020-10-12 12:30:22.540 INFO 1 --- [enerContainer-4] c.t.t.o.s.util.CommonUtil : The Code is valid for userId: 1111 systemId: sys111
2020-10-12 12:30:22.577 INFO 1 --- [enerContainer-4] c.t.t.o.s.r.Dao : Saving user into dB ..... with User-ID:1111
....
same repetitive line
Below is my SPL search commands it returns only userid group by from that specific line.
But I want the time stamp as well from that line and group by it with time chart
index="tis" logGroup="/ecs/logsmy" "logEvents{}.message"="*Validating the user with UserID*" | spath output=myfield path=logEvents{}.message | rex field=myfield "(?<=Validating the user with UserID:)(?<userId>[0-9]+)(?= systemID:)" | table userId | dedup userId | stats count values(userId) by userId
Basically I tired the below
(^(?<dtime>\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2}\.\d+) )(?<=Validating the user with UserID:)(?<userId>[0-9]+)(?= systemID:)
but it gave all the time stamp not specifically the line I mentioned above
You placed the lookaround right after matching the timestamp pattern, but you have to first move to the postition where the lookbehind is true.
If you want both values, you can match Validating the user with UserID: and systemID: instead of using a lookaround.
If there are leading whitspace chars, you could match them with \s or [^\S\r\n]*
^\s*(?<dtime>\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2}\.\d+).*\bValidating the user with UserID:(?<userId>[0-9]+) systemID:
Regex demo
I have a table with Equipment column containing strings. I want to split string, take a part of it and add this part to a new column (SerialNumber_Asset). Part of the string i want to extract always has the same pattern: A + 7 digits. Example:
Equipment SerialNumber_Asset
1 AXION 920 - A2302888 - BG-ADM-82 -NK A2302888
2 Case IH Puma T4B 220 - BG-AEH-87 - NK null
3 ARION 650 - A7702047 - BG-ADZ-74 - MU A7702047
4 ARION 650 - A7702039 - BG-ADZ-72 - NK A7702039
My code:
select x, y, z,
regexp_extract(Equipment, r'([\A][\d]{7})') as SerialNumber_Asset
FROM `aa.bb.cc`
The message i got:
Cannot parse regular expression: invalid escape sequence: \A
Any suggestions what could be wrong? Thanks
Just use A instead of [\A], check example below:
select regexp_extract('AXION 920 - A2302888 - BG-ADM-82 -NK', r'(A[\d]{7})') as SerialNumber_Asset
Trying to pull some logs and break it down. The following regex match gives me a correct match for all 4 IPs: ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}+) but not sure how to either delete the rest of the data or "extract" the IPs. Only need the IPs as shown.
June 3rd 2020, 21:18:02.193 [2020-06-03T21:18:02.781503+00:00,192.168.5.134,0,172.16.139.61,514,rslog1,imtcp,]<183>Jun 3 21:18:02 005-1attt01 atas_ssl: 1591219073.296175 CAspjq31LV8F0b 146.233.244.131 38530 104.16.148.244 443 - - - www.yahoo.com F - - F - - - - - - -
June 3rd 2020, 21:18:02.193 [2020-06-03T21:18:02.781503+00:00,192.168.5.134,0,172.16.139.61,514,rslog1,imtcp,]<183>Jun 3 21:18:02 005-1attt01 atas_ssl: 1591219073.296175 CAspjq31LV8F0b 146.233.244.131 38530 104.16.148.244 443 - - - www.yahoo.com F - - F - - - - - - -
Need this:
192.168.5.134 172.16.139.61 146.233.244.131 104.16.148.244
192.168.5.134 172.16.139.61 146.233.244.131 104.16.148.244
I have the following insert statements:
insert into temp1 values (test1, test2)
insert into temp2 values (test3)
Expected results:
insert into temp1 values (100, 200)
insert into temp2 values (300)
Essentially, I wanted to replace the first query literals test1, test2 with value 100, 200 respectively and for the second query replace test3 with value 300. Can someone help with the mapping file for the above use case?
I tried with the following, but it doesn't have any effect.
Search Value (RegEx) Replacement values
(1)(.*values.*)(.*test1)(.*,)(.*test2) -> $2 val1 $4 val2
(2)(.*values.*)(.*test1) -> $2 val3
If this is literally the extent of the mapping you need to perform, a regular ReplaceText processor is enough. Using the settings below results in the desired output:
It simply detects every instance of test followed by a single digit and replaces it with that digit and 00.
If you need to use ReplaceTextWithMapping for more complex lookups, the mapping file must be of the format:
search_value_1 replacement_value_1
search_value_2 replacement_value_2
etc.
The delimiter between the search and replacement values is \t.
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Wed Dec 07 10:48:24 PST 2016'
Key: 'lineageStartDate'
Value: 'Wed Dec 07 10:48:24 PST 2016'
Key: 'fileSize'
Value: '66'
FlowFile Attribute Map Content
Key: 'filename'
Value: '56196144045589'
Key: 'path'
Value: './'
Key: 'uuid'
Value: 'f6b28eb0-73b5-4d94-86c2-b7a5d4cc991e'
--------------------------------------------------
insert into temp1 values (100, 200)
insert into temp2 values (300)
I have a list as follows:
Policy Name: PTCC-VNX7500-server_4A
Options: 0x0
template: FALSE
Schedule: MonthlyFull
Type: FULL (0)
Calendar sched: Enabled
Allowed to retry after run day
Last day of month
Maximum MPX: 1
Synthetic: 0
Retention Level: 11 (3 years)
for which I need to extract "Schedule:"
(i.e. Schedule: MonthlyFull)
... and then "Retention Level:"
i.e. Retention Level: 11 (3 years)
... wherever this string ("Retention Level:") shows up below the word "Schedule:".
I want to wind up with something that looks like this:
PTCC-VNX7500-server_4A,MonthlyFull,11 (3 years)
PTCC-VNX7500-server_4A,WeeklyFull,8 (4 weeks)
PTCC-VNX7500-server_4A,7_Year,1 (7 years)
I've tried to find the solution here and in Perlmonks but haven't been successful.
Thanks!
You don't specify this, but assuming each schedule is under a new policy name, you can use this regex:
Policy Name:\s*([^\n]+).*?Schedule:\s*([^\n]+).*?Retention Level:\s*([^\n]+)
This checks for Policy Name: followed by 0+ whitespace characters, and then captures everything up to the new line. Next it lazily matches any amount of characters until Schedule: followed by 0+ whitespace characters, and then captures everything up to the new line. Finally, it (surprisingly?) lazily matches any amount of characters until Retention Level: followed by 0+ whitespace characters, and then captures until the new line.
As seen in the linked example, this gives you 3 groups containing the policy name, schedule, and retention level. You will need the global modifier (g) to match more than one policy at a time, the dot match new line modifier (s) to let .* match the line breaks, and optionally the case-insensitive modifier (i).
If there are multiple schedules are under one policy name, you can use this regex:
(?:Policy Name:\s*([^\n]+).*?)?Schedule:\s*([^\n]+).*?Retention Level:\s*([^\n]+)
This is very similar, we just wrap the whole Policy Name:\s*([^\n]+).*? section in a non-capturing group and make it optional. This means that it does not need to be matched. So the first match will have 3 capture groups (1: policy, 2: schedule, 3: retention), and subsequent matches may only have 2 capture groups (1: null, 2: schedule, 3: retention). You would then use your language of choice to determine that match's policy name (from the previous match).
Here is one way of doing it:
use strict;
use warnings;
my #rec;
while(my $line=<DATA>) {
if ($line =~ /Policy Name:|Schedule:|Retention Level:/) {
chomp($line);
my ($name, $value) = split /:\s*/, $line;
push #rec, $value;
if ($line =~ /Retention Level/) {
local $"=",";
print "#rec\n";
#rec = ();
}
}
}
__DATA__
Policy Name: PTCC-VNX7500-server_4A
Options: 0x0
template: FALSE
Schedule: MonthlyFull
Type: FULL (0)
Calendar sched: Enabled
Allowed to retry after run day
Last day of month
Maximum MPX: 1
Synthetic: 0
Retention Level: 11 (3 years)
Policy Name: PTCC-VNX7500-server_4A
Options: 0x0
template: FALSE
Schedule: WeeklyFull
Type: FULL (0)
Calendar sched: Enabled
Allowed to retry after run day
Last day of month
Maximum MPX: 1
Synthetic: 0
Retention Level: 8 (4 weeks)
Output:
PTCC-VNX7500-server_4A,MonthlyFull,11 (3 years)
PTCC-VNX7500-server_4A,WeeklyFull,8 (4 weeks)
use strict;
use warnings;
use autodie;
my %record;
my $last_key;
while(<DATA>) {
if (/^\s*(.*?):\s*(.*)/) {
my ($k, $v) = ($1, $2);
if ($k eq 'Policy Name' && %record) {
print join(',', #record{('Policy Name', 'Schedule', 'Retention Level')}), "\n";
%record = ();
}
$record{$k} = $v;
}
}
print join(',', #record{('Policy Name', 'Schedule', 'Retention Level')}), "\n";
__DATA__
Policy Name: PTCC-VNX7500-server_4A
Options: 0x0
template: FALSE
Schedule: MonthlyFull
Type: FULL (0)
Calendar sched: Enabled
Allowed to retry after run day
Last day of month
Maximum MPX: 1
Synthetic: 0
Retention Level: 11 (3 years)
Policy Name: PTCC-VNX7500-server_123
Options: 0x0
template: FALSE
Schedule: SometimesEmpty
Type: FULL (0)
Calendar sched: Enabled
Allowed to retry after run day
Last day of month
Maximum MPX: 1
Synthetic: 0
Retention Level: 41 (8 years)
Policy Name: PTCC-VNX7500-server_789
Options: 0x0
template: FALSE
Schedule: AlwaysBusy
Type: FULL (0)
Calendar sched: Enabled
Allowed to retry after run day
Last day of month
Maximum MPX: 1
Synthetic: 0
Retention Level: 17 (2 years)
Outputs:
PTCC-VNX7500-server_4A,MonthlyFull,11 (3 years)
PTCC-VNX7500-server_123,SometimesEmpty,41 (8 years)
PTCC-VNX7500-server_789,AlwaysBusy,17 (2 years)