regex to select only first instance of string (no duplicates) - regex

I am using this regex
(rs)\w+/
to select strings that begin with the string 'rs', i.e.
..the biomarker rs4343 but not rs4342. However rs4343 ..
this returns: rs4343, rs4242, re4343
Is it possible to use regex to select only the first instance of a matched string to avoid duplication, i.e. to return: rs4343, rs4242
I can use JS or PHP regex.

Try this:
(rs\w+)(?!.*\1)
Regex101
Details:
(rs\w+) - Group the required match
(?!.*\1) - Use negative lookahead to assert that there is no same match after this

Related

Regex to pull last 2 segments from FQDN

Working on trying to figure out some regex to pull out the last 2 segments of an FQDN.
^.*\shostname=[\w-]+\.(?P<myfield>[^\t]+)
This RegEx works and takes out the first segment of an FQDN.
www.aaa.bbb.someurl.net --> aaa.bbb.someurl.net
But… I only want to keep the last 2 segments of any FQDN.
I need it to be --> someurl.net
Other restrictions:
The hostname field will always be at least 3 segments - don't know the max.
This is for Splunk so I can't use a script. I need it to be PCRE compatible regex.
Here is an example of data:
2021-07-20 18:19:14 reason=Not allowed to browse this category event_id=12345 protocol=HTTP action=Blocked transactionsize=16051 responsesize=789 requestsize=456 urlcategory=Blocked serverip=1.2.4.5 clienttranstime=0 requestmethod=GET refererURL=None useragent=Microsoft-Delivery location=Internal ClientIP=5.6.7.8 status=403 user=John url=dl.delivery.mp.microsoft.com/filestreamingservice/files/abcd-efgh-ijkl/pieceshash vendor=Zscaler hostname=dl.delivery.mp.microsoft.com
From this I data I need the field “myfield” to be: microsoft.com.
The original answer with a much simpler regex ((?:\s|^)hostname=(?:[^\s.]+\.)*(?P<myfield>[^\s.]+\.[^\s.]+)) that worked for OP is in the question history.
You can use
(?:\s|^)hostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
Or, to match the last hostname=... value on a line:
^.*\shostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
See the regex #1 demo and regex #2 demo. Details:
(?:\s|^) - either a whitespace or start of string
hostname= - a literal substring
(?:[^\s.]+\.)*? - zero or more (but as few as possible) occurrences of one or more chars other than whitespace and dot and then a dot
(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S)) - Group "myfield": one or more chars other than whitespace and dot, then a dot, then any second-level domain or any one or more chars other than whitespace and dot and then either a whitespace or end of string.
Note: the \.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk pattern part (built from a regex trie) matches this list:
.ac.uk
.co.uk
.gov.uk
.judiciary.uk
.ltd.uk
.me.uk
.mod.uk
.net.uk
.nhs.uk
.nic.uk
.org.uk
.parliament.uk
.plc.uk
.police.uk
.royal.uk
.sch.uk
.co.uk
.ltd.uk
.me.uk
.net.uk
.nic.uk
.org.uk
.plc.uk
.sch.uk
.govt.uk
.orgn.uk
.lea.uk
.mil.uk
If you want to add more second-level domain names, add more to the list and use https://www.myregextester.com or suchlike services to built the word list regex.
You could match all following non whitspace chars after hostname= and then use a capture group to capture the last part with a single dot.
^.*\shostname=(?:\S+\.)?([^\s.]+\.[^\s.]+)
^.*\shostname=
(?:\S+\.)? Optionally match a possible dot before
( Capture group 1
[^\s.]+\.[^\s.]+ Match 2 non dot parts with a . in between
) Close group
Regex demo
If you would like to account for country codes, I've previously answered this at: Get Domain Extension From Hostname
The regular expression would look something like (shortened version): \w+((\.[a-z]{2,3})(\.(uk|au))?)$
The full expression with all country codes: \w+((\.[a-z]{2,3})(\.(ad|ae|af|ag|ai|al|am|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))?)$

regex match string unless followed by #

I'm trying to add a #param to url's I want to add it to all urls that doesn't already have the # to avoid double param. My urls doesn't look like urls they are made up of handlebar parameters.
they can look like following:
{{app.url}}
{{root.app.url}}
{{app.url}}#param
{{root.app.url}}#param
So I came up with a regex that matches the handlebar tag ({{(root.)?app.url}})
only problem is that when I later uses regexp_replace(url, '({{(root\.)?app\.url}})', '\1#param')
my result looks like this:
{{app.url}}#param
{{root.app.url}}#param
{{app.url}}#param#param
{{root.app.url}}#param#param
One solution I can think of is doing it in two steps, and the 2nd step should look for duplicate #param#param and replace that with single #param.
But it had me wondering if there was a way using regex to exclude the handlebar tags that are followed by # and completely cancel that match?
Here are some examples:
https://regex101.com/r/d3Zyvo/6
Note: this is for use in postgressql update queries. The regex is POSIX/PCRE. I must use regex_replace with back reference since there might be content before and after the hanbdlebar tags, I simply cannot just concatenate the param. (see the link).
You may use a negative lookahead (?!#):
({{(root\.)?app\.url}})(?!#)
^^^^^
See the regex demo.
Details
({{(root\.)?app\.url}}) - Group 1 (later referred to with \1 from the replacement pattern):
{{ - {{ substring
(root\.)? - an optional Group 2 matching 1 or 0 occurrences of root.
app\.url}} - a literal app.url}} substring
(?!#) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a # char.
See Table 9-17. Regular Expression Constraints:
(?!re) negative lookahead matches at any point where no substring matching re begins (AREs only)
PostgreSQL demo:
select regexp_replace('{{app.url}}
{{root.app.url}}
{{app.url}}#param
{{root.app.url}}#param',
'({{(root\.)?app\.url}})(?!#)',
'\1#param',
'g');

Regex expression to extract strings

I have the following expression:
MN=Abc123,MN=sssa,MN=abc adsa 1,MN=&3ams d'amé,MN=dat,CB=ds,CB=ds
How can I extract one by one the expressions following MN= ?
eg: firstly I want to extract Abc123, secondly I wnat sssa and so on ...
Appreciate your answer!
Use a capture group:
"[A-Z]{2}=([^,]+)"
Then get the first group form your matched object.
Or if the language you are dealing for supports the look-around you can use a positive look-behind in order to directly match the expected parts:
"(?<=[A-Z]{2}=)[^,]+"
If your regex environment supports lookbehind, you can extract desired information with this regex:
Environment supporting Lookbehind
(?<=MN=)(.*?)(?=,)
Environment not supporting Lookbehind
(?:MN=)(.*?)(?=,)
Your desired result will be stored in Group 1, aka $1
Based on your input string, here is result
Abc123
sssa
abc adsa 1
&3ams d'amé
dat
See live demo here

Regex: Negative lookahead after list match

Consider the following input string (part of css file):
url('...');
url(example.png);
The objective is to take the url part using regex and do something with it. So the first part is easy:
url\(['"]?(.+?)['"]?\)
Basically, it takes contents from inside url(...) with optional quotes symbols. Using this regexp I get the following matches:
...
example.png
So far so good. Now I want to exclude the urls which include 'data:image' in their text. I think negative lookahead is the proper tool for that but using it like this:
url\(['"]?(?!data:image)(.+?)['"]?\)
gives me the following result for the first url:
'...
Not only it doesn't exclude this match, but the matched string itself now includes quote character at the beginning. If I use + instead of first ? like this:
url\(['"]+(?!data:image)(.+?)['"]?\)
it works as expected, url is not matched. But this doesn't allow the optional quote in url (since + is 1 or more). How should I change the regex to exclude given url?
You can use negative lookahead like this:
url\((['"]?)((?:(?!data:image).)+?)\1?\)
RegEx Demo

Regex to match a URL and insert a directory

I would like to use regex to match the following:
http://www.test.com/example/sometext/
and then redirect to:
http://www.test.com/uk/example/sometext/
where 'example' is not in a list of reserved words, like _images, _lib, _css, etc.
Use a negative look ahead:
(http://www.test.com/)((?!(_images|_lib|_css))[^/]+/sometext/)
And replace with
$1uk/$2
Broken down, the juicy buts are:
(?!someregex) = a negative lookahead - ie assert the following input does not match someregex
(_images|_lib|_css) = the syntax for regex OR logic, just using literals
[^/]+ = some characters that aren't a slash