regex group matching based on first entry

regex group matching based on first entry - regex

As part of regex match, I am trying to select development / product based on first entry being dd-develop / dd.
eg.
The given code below always matches development, whether first string entry is "dd-develop" or just "dd".
I wanted to determine second or third word based on first value.
Any Ideas ?
Regex: (?(?=) (?:development) | (?:product))
Text: dd-develop development product.

From the looks of it, you're trying to decide whether to capture "development" or "product" based on the first word. This regex does that:
(:?dd-develop .*(development).*)|(?:dd .*(product).*)
If your string starts with dd-develop, it captures "development". If it starts with dd, it captures "product". To reverse this, just switch the words in the capture group.
Try it here!

Related

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.

Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

Capture repeating group with RegEx

I am trying to parse an input line looking like this:
AC#10,N850FD,10%,WEEK,IFR,1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM
The first 5 "fields" are the "header" ("AC#10,N850FD,10%,WEEK,IFR"), and the rest is are repeating groups of 6 "fields" (e.g. "1/22:45,2/00:58,390,F,0743,KEWR").
I'm a RegEx newbie, but to do this I have come up with the following RegEx statement: (AC#)(\d+),([a-zA-Z0-9]+),(\d+%),(WEEK|DAY),(IFR|VFR)(,\d\/\d{2}:\d{2},\d\/\d{2}:\d{2},\d+,[FR],\d+,[A-Z0-9]{3,5})+.
The result of the first many groups (each "field" in the "header") are extracted fine, and I can easily access each value (group). However my problem is the following/repeating groups. Only the last of the repeating "groups" are extracted. If I remove the very last "+" only the first of the repeating "groups" are extracted (naturally).
Example here: https://regex101.com/r/HsQMge/1
Here is the result I hope to get (as groups):
AC#
10
N850FD
10%
WEEK
IFR
,1/22:45,2/00:58,390,F,0743,KEWR
,3/02:30,3/05:04,380,F,1202,KMEM
,3/11:15,3/20:04,350,F,0038,LFPG
,4/04:00,4/15:35,330,F,5342,ZGGG
,4/19:05,4/22:50,370,F,5608,RJAA
,5/13:25,5/14:45,300,F,0060,RJBB
,5/18:05,6/06:35,330,F,0060,KMEM
,6/20:45,0/05:42,340,F,0948,PHNL
,0/07:21,0/12:24,370,F,0802,KLAX
,0/14:49,0/18:09,370,F,0806,KMEM

Probably RegEx is not the right tool to do this task. Maybe you can use it just for splitting string into array. Rest job is for array_chunk :
$str = "AC#10,N850FD,10%,WEEK,IFR,1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM";
$data = preg_split('/[,#]/',$str);
$data = array_chunk($data, 6);
var_dump($data);
Try it online!

I can't get it to work with one regular expression (still think it should be possible), however I got it working in two passes. First I use the following RegEx, to split the individual fields of the "header" into groups, and then grab the rest of the input line as the last group (using "(.*)" after the last comma):
(AC#)(\d+),([a-zA-Z0-9]+),(\d+%),(WEEK|DAY),(IFR|VFR),(.*)
This leaves me with the rest of the information in one single group ("1/22:45,2/00:58,390,F,0743,KEWR,3/02:30,3/05:04,380,F,1202,KMEM,3/11:15,3/20:04,350,F,0038,LFPG,4/04:00,4/15:35,330,F,5342,ZGGG,4/19:05,4/22:50,370,F,5608,RJAA,5/13:25,5/14:45,300,F,0060,RJBB,5/18:05,6/06:35,330,F,0060,KMEM,6/20:45,0/05:42,340,F,0948,PHNL,0/07:21,0/12:24,370,F,0802,KLAX,0/14:49,0/18:09,370,F,0806,KMEM"). I then parse this group with another regular expression, that groups the repeating sections (without a problem - now there is no longer a "header"):
(\d\/\d{2}:\d{2},\d\/\d{2}:\d{2},\d+,[FR],\d+,[A-Z0-9]{3,4})+
The groups are as I had hoped (even better as "," is no longer part of the result). Odd its no working with the "header". Anyhow I don't have to resort to "manually" splitting the line, and the RegEx statements can still "validate" each section.

regex selecting multiple field

From the following example pattern, I want to select the first 3 entries in the line.
Say:
timestamp
hostname
the first word after the hostname
Example pattern:
2017-04-24T09:20:01.687387+00:00 aabvabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
2017-04-24T09:20:01.687387+00:00 aacdefabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
I have used following regex and it worked fine.
REGEX 1 - ^(?:[^\s]\s){1}([^\s]) - to select the timestamp and hostname.
REGEX 2 - ^(?:[^\s]*\s){2}([^\s]\w+) - to select the word after the hostname.
2017-04-24T09:20:01.687387+00:00 hostd probing is done Fdm: sslThumbprint>95:43:64:71:A3:60:D8:17:C8:6F:68:83:92:CE:E4:3B:53:4E:1D:AD10.199.6.5a2:0e:09:01:0a:00a2:0e:09:01:0b:01/vmfs/volumes/b01f388c-aaa4889f/vmfs/volumes/6ad2d8d7-86746df14435.5.03568722host-619286aabvabcs16.def.co.uk
But the above log has created the problem, as it is not in a standard syslog format it has picked "hostd" as the hostname.
I would like to have regex which need to select the logs which has timestamp as the first entry, hostname as second entry (it always ends with.def.co.uk) and if it satisfies both then select the 3rd entry.
How can I achieve this?

^(\S+[^\s])\s(\w+\.def.co.uk)\s(.+?)\s Demo
Break down :
(\S+[^\s])\s capture out date and timestamp, and leave out the space after it
(\w+\.def.co.uk)\s capture only if it contains something.def.co.uk, and leave the space out again
(.+)? non greedily capture the first word (assuming word means no space in between
EDIT :
Unless you also want the date and time to be in their own capture groups, then it should be like this:
^(\S+)(T\S+)\s(\w+\.def.co.uk)\s(.+?)\s
Hope this helps!

Google Analytics - Content grouping - Regex fix

This is our URL structure:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
http://www.disabledgo.com/access-guide/kingston-university/coombehurst-court-2
http://www.disabledgo.com/access-guide/kings-college-london/franklin-wilkins-building-2
http://www.disabledgo.com/access-guide/redbridge-college/brook-centre-learning-resource-centre
I am trying to create a list of groups based on the client names
/access-guide/[this bit]/...
So I can have a performance list of all our clients.
This is my regex:
/access-guide/(.*universit(y|ies)|.*colleg(e|es))/
I want it to group anything that has university/ies or college/es in it, at any point within that client name section of the URL.
At the moment, my current regex will only return groups that are X-University:
Durham-University
Plymouth-University
Cardiff-University
etc.
What does the regex need to be to have the list I'm looking for?
Do I need to have something at the end to stop it matching things after the client name? E.g. ([^/]+$)?
Thanks for your help in advance!

Depending upon your needs you may want to do:
/access-guide/([^/]*(?:university|universities|college|colleges)[^/]*)/
This will match names even if "university" or "college" is not at the end of the string. For example "college-of-the-ozarks" Note the non-capturing internal parenthesis, that should probably be used no matter what solution you go with, as you don't want to just match the word "university" or "college"
Live Example
Additionally, I don't know what may be in your but if you may have compound words you want to eliminate using a \b may be advisable. For instance if you don't want to match "miskatonic-postcollege" you may want to do something like this:
/access-guide/([^/]*\b(?:university|universities|college|colleges)\b[^/]*)/

If the client name section of the URL is after the access-guid/ and before the next /:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
|----------------------------|
you need to use a negated character class to only match university before the regex reaches that rightmost / boundary.
As per the Reference:
You can extract pages by Page URL, Page Title, or Screen Name. Identify each one with a regex capture group (Analytics uses the first capture group for each expression)
Thus, you can use
/access-guide/([^/]*(universit(y|ies)|colleges?))
^^^^^
See demo.
The regex matches
/access-guide/ - leftmost boundary, matches /access-guide/ literally
[^/]* - any character other than / (so we still remain in that customer section)
(universit(y|ies)|colleges?) - university, or universities, orcollegeorcolleges` literally. Add more if needed.

Regex pattern for containing string as well as not ending with pattern

I have been asked to make 2 regex to determine by the URL if a page is a product page or a category page.
These are the URLs:
Product page: www.domain.com/art/something/someotherthing/article(X123456.123)/
Category page: www.domain.com/art/something/someotherthing
I created this regex which works fine for the product page:
^.*\/art.*\/[xX]?[0-9]{6,7}\.[0-9]+\/$
Now I have problems with the category page. The only thing I see that is possible is to make sure it does not end with the pattern that check the ending numbers "[xX]?[0-9]{6,7}.[0-9]+". But I also need to make sure that it starts with /art/ after the domain.
My first try was this for the category page:
.*\/art.*\/(?!([xX]?[0-9]{6,7}\.[0-9]+\(\/)?))$
This doesn't work since negative lookup is positive since it does not find the pattern after the 2nd any characters matching (.*).

Looks like a differencing factor is the number of slashes, possibly excluded by an optional end-slash that is often ignored.
^[^\/]*(\/[^\/]*){3}\/?$ would match the category, and
^[^\/]*(\/[^\/]*){4}\/?$ would match the product.

I think you don't have to use any lookarounds here.
Since the domain is permanent and the art is permanent and the last part of the product like article+something is permanent you can use them explicitly in the regex making it faster.
For product:
^www\.domain\.com\/art\/[^\/]+\/[^\/]+\/article\([^\/]+\)\/$
For category:
^www\.domain\.com\/art\/[^\/]+\/[^\/]+\/$

From the question description and the URL data given...
Product URLs
matched by ^([^\/\r\n]+?)\/(art)\/(.*)\/.*?\(([xX]?[0-9]{6,7}\.[0-9]+)\).*?\/?$
1st capture == domain
2nd capture == art (main category?)
3rd capture == category
4th capture == Product ID
Category URLs
matched by ^([^\/\r\n]+?)\/(art)\/((?!.*[xX]?[0-9]{6,7}\.[0-9]+).*?)\/?$
1st capture == domain
2nd capture == art (main category?)
3rd capture == category
I did infer that the trailing / was optional for both URLs, but that may be an incorrect assumption.
The above regex's link to live regex101 fiddlers with the given regex plus test data.
Do note that the \r\n inclusion within the character class for the domain match is only needed because the regex101 fiddler match is done globally on combined test data. You can remove that character sequence if you are only matching against a single URL at a time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex group matching based on first entry - regex

Related

How to use Postgres Regex Replace with a capture group

Capture repeating group with RegEx

regex selecting multiple field

Google Analytics - Content grouping - Regex fix

Regex pattern for containing string as well as not ending with pattern

Categories

Resources