regex using positive lookahead - regex

My source data text looks something like this:
a1,a2,a3
a4,a5,a6
a7,a8,a9
test="1"
b1,b2,b3
b4,b5,b6
b7,b8,b9
test="2"
c1,c2,c3
c4,c5,c6
c7,c8,c9
test="3"
I need to parse this so the end result looks like this (appropriate “test” field included in each row):
a1,a2,a3,1
a4,a5,a6,1
a7,a8,a9,1
b1,b2,b3,2
b4,b5,b6,2
b7,b8,b9,2
c1,c2,c3,3
c4,c5,c6,3
c7,c8,c9,3
...etc
this what I started with and captures the fields correctly:
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+
I understand I need to use lookarounds to capture and include the “test” field on each line.
So something like this added (using a positive lookahead)…
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+(?=test="(?<test>.*?)")
This seems close but is not yielding all rows of data, but instead only the last row of data with the included test value as if it is consuming the look ahead row.
This expression with its captured groups are input into a .NET application that inserts these captured groups as fields within a database table. Number of fields is always static (4 in the example above; field1=f1, field2=f2, field3=f3, field4=test), but the number of records will be variable.
Any guidance would be appreciated.

Parsing your data to extract the relevant values
You are almost there, but need to allow the look ahead to skip the rows between the current one and the test line:
(?ms)(?<f1>\w+),(?<f2>\w+),(?<f3>\w+)\R(?=.*?^test="(?<test>\d+)")
\R matches all sort of newlines, (?ms) is the inline way of turning on the multiline and dot match all modifiers, so that .*?^test matches every line up to the test one, see demo here.
Again, your issue was that \s+ forced the lookahead to be on the line right after the one your were matching.

Related

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.
Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

Capture repeating group with RegEx

I am trying to parse an input line looking like this:
AC#10,N850FD,10%,WEEK,IFR,1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM
The first 5 "fields" are the "header" ("AC#10,N850FD,10%,WEEK,IFR"), and the rest is are repeating groups of 6 "fields" (e.g. "1/22:45,2/00:58,390,F,0743,KEWR").
I'm a RegEx newbie, but to do this I have come up with the following RegEx statement: (AC#)(\d+),([a-zA-Z0-9]+),(\d+%),(WEEK|DAY),(IFR|VFR)(,\d\/\d{2}:\d{2},\d\/\d{2}:\d{2},\d+,[FR],\d+,[A-Z0-9]{3,5})+.
The result of the first many groups (each "field" in the "header") are extracted fine, and I can easily access each value (group). However my problem is the following/repeating groups. Only the last of the repeating "groups" are extracted. If I remove the very last "+" only the first of the repeating "groups" are extracted (naturally).
Example here: https://regex101.com/r/HsQMge/1
Here is the result I hope to get (as groups):
AC#
10
N850FD
10%
WEEK
IFR
,1/22:45,2/00:58,390,F,0743,KEWR
,3/02:30,3/05:04,380,F,1202,KMEM
,3/11:15,3/20:04,350,F,0038,LFPG
,4/04:00,4/15:35,330,F,5342,ZGGG
,4/19:05,4/22:50,370,F,5608,RJAA
,5/13:25,5/14:45,300,F,0060,RJBB
,5/18:05,6/06:35,330,F,0060,KMEM
,6/20:45,0/05:42,340,F,0948,PHNL
,0/07:21,0/12:24,370,F,0802,KLAX
,0/14:49,0/18:09,370,F,0806,KMEM
Probably RegEx is not the right tool to do this task. Maybe you can use it just for splitting string into array. Rest job is for array_chunk :
$str = "AC#10,N850FD,10%,WEEK,IFR,1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM";
$data = preg_split('/[,#]/',$str);
$data = array_chunk($data, 6);
var_dump($data);
Try it online!
I can't get it to work with one regular expression (still think it should be possible), however I got it working in two passes. First I use the following RegEx, to split the individual fields of the "header" into groups, and then grab the rest of the input line as the last group (using "(.*)" after the last comma):
(AC#)(\d+),([a-zA-Z0-9]+),(\d+%),(WEEK|DAY),(IFR|VFR),(.*)
This leaves me with the rest of the information in one single group ("1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM"). I then parse this group with another regular expression, that groups the repeating sections (without a problem - now there is no longer a "header"):
(\d\/\d{2}:\d{2},\d\/\d{2}:\d{2},\d+,[FR],\d+,[A-Z0-9]{3,4})+
The groups are as I had hoped (even better as "," is no longer part of the result). Odd its no working with the "header". Anyhow I don't have to resort to "manually" splitting the line, and the RegEx statements can still "validate" each section.

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Google Analytics - Content grouping - Regex fix

This is our URL structure:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
http://www.disabledgo.com/access-guide/kingston-university/coombehurst-court-2
http://www.disabledgo.com/access-guide/kings-college-london/franklin-wilkins-building-2
http://www.disabledgo.com/access-guide/redbridge-college/brook-centre-learning-resource-centre
I am trying to create a list of groups based on the client names
/access-guide/[this bit]/...
So I can have a performance list of all our clients.
This is my regex:
/access-guide/(.*universit(y|ies)|.*colleg(e|es))/
I want it to group anything that has university/ies or college/es in it, at any point within that client name section of the URL.
At the moment, my current regex will only return groups that are X-University:
Durham-University
Plymouth-University
Cardiff-University
etc.
What does the regex need to be to have the list I'm looking for?
Do I need to have something at the end to stop it matching things after the client name? E.g. ([^/]+$)?
Thanks for your help in advance!
Depending upon your needs you may want to do:
/access-guide/([^/]*(?:university|universities|college|colleges)[^/]*)/
This will match names even if "university" or "college" is not at the end of the string. For example "college-of-the-ozarks" Note the non-capturing internal parenthesis, that should probably be used no matter what solution you go with, as you don't want to just match the word "university" or "college"
Live Example
Additionally, I don't know what may be in your but if you may have compound words you want to eliminate using a \b may be advisable. For instance if you don't want to match "miskatonic-postcollege" you may want to do something like this:
/access-guide/([^/]*\b(?:university|universities|college|colleges)\b[^/]*)/
If the client name section of the URL is after the access-guid/ and before the next /:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
|----------------------------|
you need to use a negated character class to only match university before the regex reaches that rightmost / boundary.
As per the Reference:
You can extract pages by Page URL, Page Title, or Screen Name. Identify each one with a regex capture group (Analytics uses the first capture group for each expression)
Thus, you can use
/access-guide/([^/]*(universit(y|ies)|colleges?))
^^^^^
See demo.
The regex matches
/access-guide/ - leftmost boundary, matches /access-guide/ literally
[^/]* - any character other than / (so we still remain in that customer section)
(universit(y|ies)|colleges?) - university, or universities, orcollegeorcolleges` literally. Add more if needed.