Boolean regex AND with multiple capture groups in Javascript not working - regex

I'm attempting to capture 3 mandatory elements of a URL with multiple capture groups in Javascript Regex, but for the life of me I can't get it to work. Does anyone know what I'm doing wrong?
https://bobsfurniture.com/chairs/COZRdyga141uWgV5w/purchase/?itemIds=qUUWmD7eRaCz9wnJEGLZQQ
I'm trying to capture the domain, the product category (chairs), and the purchase. This is my capture grouping:
(?=.*bobsfurniture)(?=.*chairs)(?=.*purchase)

Related

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.
Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

CloudWatch Insights - Group logs by url with unique ids removed

I'm looking to use CloudWatch Logs Insights to group logs by a request url field, however the url can contain 0-2 unique numerical identifiers that I'd like to be ignored when doing the grouping.
Some examples of urls:
/dev/user
/dev/user/123
/dev/user/123/inventory/4
/dev/server/3/statistics
The groups would look something like:
/dev/user
/dev/user/
/dev/user//inventory/
/dev/server//statistics
I have something quite close to what I need which extracts the section of the url in front of the first optional identifier and the section between the first identifier and the second identifier and concatenates the two, but it isn't totally reliable. This is where I'm at currently, #message is valid json which containers an 'endpoint' field that looks like one of the urls above:
fields #message | parse endpoint /(\bdev)\/(?<#prefix>[^0-9]+)(?:[0-9]+)(?<#suffix>[^0-9]+)/ | stats count(*) by #prefix
While this query will work with endpoints like '/dev/accounts/1' it ignores endpoints like '/dev/accounts' as it doesn't have all of the components the regex is looking for, which means I'm missing a lot of results.
If there are 0-2 numerical identifiers that you want to remove, you could match the first and optionally match the second number and use 2 capturing groups to capture what you want to keep.
In the replacement use the 2 capturing groups $1$2
^(.*?\/)\d+(?:(.*?\/)\d+\b)?
Regex demo
Looks like I can use question marks outside of capture groups to mark those groups as optional, which has resolved the last issue I was having.
Regex demo

Regex PCRE capture multiple occurences query string in URL

I am trying to capture multiple occurence of utm tag in a URL and append when re-writing the url. However i just want utm key values and skip others.
This is a sample URL
https://example.com/dl/?screen=page&title=SABC&page_id=4063&myvalue=Noidea&utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&test=value&utm_content=contentTest19
I tried this:
(\?.*)(page_id=([^&]*))(\?|&)(.*[&?]utm_[a-z]+=([^&]+).*)
and unfortunately, it doesn't produce the result I expect.
I need to capture PAGE ID and utm tags both, but do not want test=value, myvalue=Noidea and only want query strings with utm tags.
Expected Result is the URL below:
https://example.com/dl/page_id/4063?utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&utm_content=contentTest19
one group with pageid=<somenumber/text>
one group with all utm tags with key and value
Help will be appreciated.
You can make regex like this to get group result:
(?:(page_id|utm_[a-z]+)=[A-z0-9]+)(?:^\&)?
You can instead replace any parameter that does not match the desired ones with the empty string. The pattern for this is
(?:[?&](?!(?:page_id|utm_[^=&]++)=)[^&]*+)++$|(?<=[?&])(?!(?:page_id|utm_[^=&]++)=)[^&]*+(?:&|$)
Here's a working proof: https://regex101.com/r/L5xcl4/2 It has an extra \s only so it works on the multiline input in the tester, but you shouldn't need it as you'll be working on a string that contains only a URL without whitespace.

How to combine groups in Regex with non capturing groups to have all optional

What I'm trying to achieve: I want to match user entered sentence with my templates and to see which template matches better (as many groups out of all in template as possible).
Regex which I'm building to solve example:
^(\bMyCompany1\b)?(?:.+)?\s(\bestablishes\b)?(?:.+)?\s(\bAnotherCompany\b)?(?:.+)?$
Example sentences:
'MyCompany1 establishes AnotherCompany' - matches all 3 groups. is OK
'MyCompany1 establ AnotherCompany' - matches first and last group. ignres the middle typo. is also Ok
'MyCompany1 establishes AnotherCompany ' - space in the end. cannot identify 2 and 3 groups. I don't understand why
'MyCompany1 establishes AnotherCompany' - additional spaces after word 'establishes'. For some reason is not detecting 2nd group anymore
This regex is just an example of one template. I will have 1 regex (build dynamically) per each template. Like 'User1 sent a request to User2', 'Company1 borrowed to Company2 $111' My idea is to define each part of the template and to see how many parts I matched. E.g. in my example: - I expect some company name from the list (MyCompany or MyCompany1) or non capturing group to ignore the rest (maybe user did a typo or is just typing and hasn't finished) - I expect same order of groups to be there
Can you please explain what I'm doing wrong in my Regex? Is it correct to achieve that by using Regex at all?
This is covering all your test cases, it is based on 3 lookaheads, each one contain an optional non-capture that includes a group for every keywords you're looking for.
^(?=(?:.*(\bMyCompany1\b))?)(?=(?:.*?(\bestablishes\b))?)(?=(?:.*(\bAnotherCompany\b))?).*$
You'll get regex explanation at the link below:
Demo
Or, if the order matter:
^(?:.*(\bMyCompany1\b))?(?:.*?(\bestablishes\b))?(?:.*(\bAnotherCompany\b))?.*$
Demo
could you please try below regex
^(\bMyCompany1\b)?\s+(\bestablishes\b)?\s+(\bAnotherCompany\b)?(?:.+)?$
hope it helps

Regex pattern for containing string as well as not ending with pattern

I have been asked to make 2 regex to determine by the URL if a page is a product page or a category page.
These are the URLs:
Product page: www.domain.com/art/something/someotherthing/article(X123456.123)/
Category page: www.domain.com/art/something/someotherthing
I created this regex which works fine for the product page:
^.*\/art.*\/[xX]?[0-9]{6,7}\.[0-9]+\/$
Now I have problems with the category page. The only thing I see that is possible is to make sure it does not end with the pattern that check the ending numbers "[xX]?[0-9]{6,7}.[0-9]+". But I also need to make sure that it starts with /art/ after the domain.
My first try was this for the category page:
.*\/art.*\/(?!([xX]?[0-9]{6,7}\.[0-9]+\(\/)?))$
This doesn't work since negative lookup is positive since it does not find the pattern after the 2nd any characters matching (.*).
Looks like a differencing factor is the number of slashes, possibly excluded by an optional end-slash that is often ignored.
^[^\/]*(\/[^\/]*){3}\/?$ would match the category, and
^[^\/]*(\/[^\/]*){4}\/?$ would match the product.
I think you don't have to use any lookarounds here.
Since the domain is permanent and the art is permanent and the last part of the product like article+something is permanent you can use them explicitly in the regex making it faster.
For product:
^www\.domain\.com\/art\/[^\/]+\/[^\/]+\/article\([^\/]+\)\/$
For category:
^www\.domain\.com\/art\/[^\/]+\/[^\/]+\/$
From the question description and the URL data given...
Product URLs
matched by ^([^\/\r\n]+?)\/(art)\/(.*)\/.*?\(([xX]?[0-9]{6,7}\.[0-9]+)\).*?\/?$
1st capture == domain
2nd capture == art (main category?)
3rd capture == category
4th capture == Product ID
Category URLs
matched by ^([^\/\r\n]+?)\/(art)\/((?!.*[xX]?[0-9]{6,7}\.[0-9]+).*?)\/?$
1st capture == domain
2nd capture == art (main category?)
3rd capture == category
I did infer that the trailing / was optional for both URLs, but that may be an incorrect assumption.
The above regex's link to live regex101 fiddlers with the given regex plus test data.
Do note that the \r\n inclusion within the character class for the domain match is only needed because the regex101 fiddler match is done globally on combined test data. You can remove that character sequence if you are only matching against a single URL at a time.