Capture repeating group with RegEx - regex

I am trying to parse an input line looking like this:
AC#10,N850FD,10%,WEEK,IFR,1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM
The first 5 "fields" are the "header" ("AC#10,N850FD,10%,WEEK,IFR"), and the rest is are repeating groups of 6 "fields" (e.g. "1/22:45,2/00:58,390,F,0743,KEWR").
I'm a RegEx newbie, but to do this I have come up with the following RegEx statement: (AC#)(\d+),([a-zA-Z0-9]+),(\d+%),(WEEK|DAY),(IFR|VFR)(,\d\/\d{2}:\d{2},\d\/\d{2}:\d{2},\d+,[FR],\d+,[A-Z0-9]{3,5})+.
The result of the first many groups (each "field" in the "header") are extracted fine, and I can easily access each value (group). However my problem is the following/repeating groups. Only the last of the repeating "groups" are extracted. If I remove the very last "+" only the first of the repeating "groups" are extracted (naturally).
Example here: https://regex101.com/r/HsQMge/1
Here is the result I hope to get (as groups):
AC#
10
N850FD
10%
WEEK
IFR
,1/22:45,2/00:58,390,F,0743,KEWR
,3/02:30,3/05:04,380,F,1202,KMEM
,3/11:15,3/20:04,350,F,0038,LFPG
,4/04:00,4/15:35,330,F,5342,ZGGG
,4/19:05,4/22:50,370,F,5608,RJAA
,5/13:25,5/14:45,300,F,0060,RJBB
,5/18:05,6/06:35,330,F,0060,KMEM
,6/20:45,0/05:42,340,F,0948,PHNL
,0/07:21,0/12:24,370,F,0802,KLAX
,0/14:49,0/18:09,370,F,0806,KMEM

Probably RegEx is not the right tool to do this task. Maybe you can use it just for splitting string into array. Rest job is for array_chunk :
$str = "AC#10,N850FD,10%,WEEK,IFR,1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM";
$data = preg_split('/[,#]/',$str);
$data = array_chunk($data, 6);
var_dump($data);
Try it online!

I can't get it to work with one regular expression (still think it should be possible), however I got it working in two passes. First I use the following RegEx, to split the individual fields of the "header" into groups, and then grab the rest of the input line as the last group (using "(.*)" after the last comma):
(AC#)(\d+),([a-zA-Z0-9]+),(\d+%),(WEEK|DAY),(IFR|VFR),(.*)
This leaves me with the rest of the information in one single group ("1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM"). I then parse this group with another regular expression, that groups the repeating sections (without a problem - now there is no longer a "header"):
(\d\/\d{2}:\d{2},\d\/\d{2}:\d{2},\d+,[FR],\d+,[A-Z0-9]{3,4})+
The groups are as I had hoped (even better as "," is no longer part of the result). Odd its no working with the "header". Anyhow I don't have to resort to "manually" splitting the line, and the RegEx statements can still "validate" each section.

Related

Regex capture group with non-uniform space group

I'm trying to parse the output of the "display interface brief" Comware switch command to convert it to a CSV file using RegEx. This command is printed using the following format:
Interface Link Speed Duplex Type PVID Description
BAGG51 UP 4G(a) F(a) T 1
FGE1/0/42 DOWN auto A T 1 ### LIVRE ###
GE6/0/20 UP 100M(a) F(a) A 1 LIVRE (MGMT - [WAN8-P8]
It's seems quite challenging for me because doesn't matter which RegEx I try, it doesn't properly handle "DOWN auto" and "100M(a) F(a)" output that has only one space between them. I also couldn't find a way to properly handle the last field, that can contain one or more spaces, but into most RegEx that I tried it create a separate capture group for each space instead of handling it's text content properly.
I'd also tried countless ways to try to parse it, and I couldn't find much content about parsing non-uniform columns into the Internet and StackOverflow community.
I need to parse it into the following format, with 7 capture groups per line, respecting the end of line:
BAGG51;UP;4G(a);F(a);T;1
FGE1/0/42;DOWN;auto;A;T;1;### LIVRE ###
GE6/0/20;UP;100M(a);F(a);A;1;LIVRE (MGMT - [WAN8-P8]
The most successfully RegEx that I found so far was: ^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+) replacing it to $1;$2;$3;$4;$5;$6;$7 using Notepad++ but it doesn't properly handle the "Description" field, that can be empty.
The following pattern seems to be working here:
^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:[ ]+(.*))?
This follows your pattern with six mandatory capture groups, followed by an optional seventh capture group. The (?:[ ]+(\S+))? at the end of the pattern matches one or more spaces followed by the content. Note that this pattern should be used in multiline mode.
Here is a working demo

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

Regex: matching unknown groups that repeat?

I'm trying to create a generic regex pattern for a crawler, to avoid so called "crawler traps" (links that just add url parameters and refer to the exact same page, which results in tons of useless data). Alot of times, those links just add the same part to a URL over and over again. Here is an example out of a log file:
http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/...
I can use regular expressions to narrow the scope of the crawler and i would love to have a pattern, that tells the crawler to ignore everything that has repeating parts. Is that possible with a regex?
Thanks in advance for some tips!
JUST TO CLARIFY:
the crawlertraps are not designed to prevent crawling, they are a result of poor web design. All the pages we are crawling explicitly allowed us to do so!
If you are already looping through a list of URLs, you could add matching as a condition to skip the current iteration:
array = ["/abcd/abcd/abcd/abcd/", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/", "http://examplepage/apple/cake/banana/"]
import re
pattern1 = re.compile(r'.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*')
for url in array:
if re.match(pattern1, url):
print "It matches; skipping this URL"
continue
print url
Example regex:
.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*
([^\/\&?]{4,}) matches and captures sequences of anything, but not containing [/&?], repeated 4 or more times.
(?:[\/\&\?]) looks for one /,& or ?
(.*?(?:[\/\&\?])\1){3,} match anything until [/&?], followed by what we captured, doing all of this 3 or more times.
demo
You can use a backreference in Python/PERL regexes (and possibly others) to catch a pattern which is repeated:
>>> re.search(r"(/.+)\1", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/").group(1)
'/cssms/chrome'
\1 references the first match, so (/.+)\1 means the same sequence repeated twice in a row. The leading / is just to avoid the regex matching the first single repeating letter (which is the t in http) and catch repetitions in the path.

Using regex multiple capture groups to split up a string

I have a file that looks like this...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST 2~TEST3"
"1234567123456","R","TEST4~TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8~TEST9~TEST,10"
All I'm trying to do is parse the D and R lines. The ~ is used in this case as a separator. So the end results would be...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST3"
"1234567123456","D","TEST3"
"1234567123456","R","TEST4"
"1234567123456","R","TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8"
"1234567123457","R","TEST9"
"1234567123457","R","TEST,10"
I'm using regex on applications like Textpad and Notepad++. I have not figured out how to use a regex like /.+/g because the applications do not like the forward slashes. So I don't think I can use things like the global modifier. I currently have the following regex...
//In a program like Textpad/Notepad++
<FIND> "(.{13})","D","([^~]*)~(.*)
<REPLACE> "\1","D","\2"\n"\1","D","\3
Now if I run a find and replace with the above params a few times it would work fine (for the D lines only). The problem is there is an unknown number of lines to be made. For example...
"1234567123456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"1234567123457","D","TEST1~TEST2~TEST3"
"1234567123458","D","TEST1~TEST2"
"1234567123459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
Note: If I could get rid of the leading and trailing spaces EX: "1234567123456","D","TEST1 " ending up as "1234567123456","D","TEST1" that would be even better but not necessary.
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/

How to Regex Multiple URLs From Same Variable In Perl

I'm trying to search a field in a database to extract URLs. Sometimes there will be more than 1 URL in a field and I would like to extract those in to separate variables (or an array).
I know my regex isn't going to cover all possibilities. As long as I flag on anything that starts with http and ends with a space I'm ok.
The problem I'm having is that my efforts either seem to get only 1 URL per record or they get only 1 the last letter from each URL. I've tried a couple different techniques based on solutions other have posted but I haven't found a solution that works for me.
Sample input line:
Testing http://marko.co http://tester.net Just about anything else you'd like.
Output goal
$var[0] = http://marko.co
$var[1] = http://tester.net
First try:
if ( $status =~ m/http:(\S)+/g ) {
print "$&\n";
}
Output:
http://marko.co
Second try:
#statusurls = ($status =~ m/http:(\S)+/g);
print "#statusurls\n";
Output:
o t
I'm new to regex, but since I'm using the same regex for each attempt, I don't understand why it's returning such different results.
Thanks for any help you can offer.
I've looked at these posts and either didn't find what I was looking for or didn't understand how to implement it:
This one seemed the most promising (and it's where I got the 2nd attempt from, but it didn't return the whole URL, just the letter: How can I store regex captures in an array in Perl?
This has some great stuff in it. I'm curious if I need to look at the URL as a word since it's bookended by spaces: Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
This one offers similar suggestions as the first two. How can I store captures from a Perl regular expression into separate variables?
Solution:
#statusurls = ($status =~ m/(http:\S+)/g);
print "#statusurls\n";
Thanks!
I think that you need to capture more than just one character. Try this regex instead:
m/http:(\S+)/g