Regex to pull last 2 segments from FQDN - regex

Working on trying to figure out some regex to pull out the last 2 segments of an FQDN.
^.*\shostname=[\w-]+\.(?P<myfield>[^\t]+)
This RegEx works and takes out the first segment of an FQDN.
www.aaa.bbb.someurl.net --> aaa.bbb.someurl.net
But… I only want to keep the last 2 segments of any FQDN.
I need it to be --> someurl.net
Other restrictions:
The hostname field will always be at least 3 segments - don't know the max.
This is for Splunk so I can't use a script. I need it to be PCRE compatible regex.
Here is an example of data:
2021-07-20 18:19:14 reason=Not allowed to browse this category event_id=12345 protocol=HTTP action=Blocked transactionsize=16051 responsesize=789 requestsize=456 urlcategory=Blocked serverip=1.2.4.5 clienttranstime=0 requestmethod=GET refererURL=None useragent=Microsoft-Delivery location=Internal ClientIP=5.6.7.8 status=403 user=John url=dl.delivery.mp.microsoft.com/filestreamingservice/files/abcd-efgh-ijkl/pieceshash vendor=Zscaler hostname=dl.delivery.mp.microsoft.com
From this I data I need the field “myfield” to be: microsoft.com.

The original answer with a much simpler regex ((?:\s|^)hostname=(?:[^\s.]+\.)*(?P<myfield>[^\s.]+\.[^\s.]+)) that worked for OP is in the question history.
You can use
(?:\s|^)hostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
Or, to match the last hostname=... value on a line:
^.*\shostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
See the regex #1 demo and regex #2 demo. Details:
(?:\s|^) - either a whitespace or start of string
hostname= - a literal substring
(?:[^\s.]+\.)*? - zero or more (but as few as possible) occurrences of one or more chars other than whitespace and dot and then a dot
(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S)) - Group "myfield": one or more chars other than whitespace and dot, then a dot, then any second-level domain or any one or more chars other than whitespace and dot and then either a whitespace or end of string.
Note: the \.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk pattern part (built from a regex trie) matches this list:
.ac.uk
.co.uk
.gov.uk
.judiciary.uk
.ltd.uk
.me.uk
.mod.uk
.net.uk
.nhs.uk
.nic.uk
.org.uk
.parliament.uk
.plc.uk
.police.uk
.royal.uk
.sch.uk
.co.uk
.ltd.uk
.me.uk
.net.uk
.nic.uk
.org.uk
.plc.uk
.sch.uk
.govt.uk
.orgn.uk
.lea.uk
.mil.uk
If you want to add more second-level domain names, add more to the list and use https://www.myregextester.com or suchlike services to built the word list regex.

You could match all following non whitspace chars after hostname= and then use a capture group to capture the last part with a single dot.
^.*\shostname=(?:\S+\.)?([^\s.]+\.[^\s.]+)
^.*\shostname=
(?:\S+\.)? Optionally match a possible dot before
( Capture group 1
[^\s.]+\.[^\s.]+ Match 2 non dot parts with a . in between
) Close group
Regex demo

If you would like to account for country codes, I've previously answered this at: Get Domain Extension From Hostname
The regular expression would look something like (shortened version): \w+((\.[a-z]{2,3})(\.(uk|au))?)$
The full expression with all country codes: \w+((\.[a-z]{2,3})(\.(ad|ae|af|ag|ai|al|am|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))?)$

Related

Using regex to extract a string including a space between other strings and whitespace

I have the following output from my wireless controller and also the regex statement below. I am trying to parse out the various values using regex named capturing groups. The space in the 'Global/whatever Lab/Lab01' value is throwing everything after that value off. Is there a way to repalce the \S+ after the group to capture the whole value of 'Global/whatever Lab/Lab01'? Thank you.
Number of APs: 2\nAP Name Slots AP Model Ethernet MAC Radio MAC Location Country IP Address State \nAPAC4A.56BE.18A0 2 9120AXI ac4a.56be.18a0 045f.b91a.0a40 Global/whatever Lab/Lab01 US 2.2.2.2 Registered \nAPHAV-LAB-TEST-01 2 9120AXI ac4a.56be.8cd4 045f.b91d.4ce0 default location US 1.1.1.1 Registered
(?P<ap_name>\S+)\s+(?P<slots>\d+)\s+(?P<model_number>\S+)\s+(?P<ether_mac>\S+)\s+(?P<radio_mac>\S+)\s+(?P<location>\S+)\s(?P<country>\S+)\s+(?P<ip_address>\S+)?\s+(?P<state>\S+)
When you need to match a multi-word field value, make sure you can describe the format of the field(s) next to it. Once you know the rules, you can match the "unknown" field with a mere .*? pattern.
See an example solution:
(?P<ap_name>\S+)\s+(?P<slots>\d+)\s+(?P<model_number>\S+)\s+(?P<ether_mac>\S+)\s+(?P<radio_mac>\S+)\s+(?P<location>.*?)\s+(?P<country>[A-Z]{2,})(?:\s+(?P<ip_address>\d{1,3}(?:\.\d{1,3}){3}))?\s+(?P<state>\S+)
See the regex demo.
Now, the location group pattern is (?P<location>.*?) and it matches any char, 0 or more occurrences but as few times as possible, other than line break chars, and it is possible here since the next group pattern, country group, is now (?P<country>[A-Z]{2,}) and matches any substring of two or more uppercase ASCII letters.
Note I also "spelled out" the ip_address group pattern and made the whole part with initial whitespaces optional, (?:\s+(?P<ip_address>\d{1,3}(?:\.\d{1,3}){3}))?.
maybe try doing something similar to replacing \S+ with [\S ]+
I don't know if there's anything like an escape code you can use to represent the space between the []

Regex question- string must appear in a specific way or not at all

i need help with a REGEX expression (for analytics).
not sure how to handle the requirements.
Here's an example of a URL:
/a.html?ref=aa&project=11&utm=bb
This URL would have &project=XX in the middle but it is possible that &project won't be there at all..
Requirements:
I want the regex to be positive only for specific project=XX (for example only when XX equals 11 or 12 or 13) but negative for all other values (project=22).
The parameter before it (?ref in the example below) is mandatory
Any parameter afterwards (&utm) is optional
For example:
fine: /a.html?ref=aa&project=11&utm=bb
fine: /a.html?ref=aa&utm=bb
not fine: /a.html?ref=aa&project=22&utm=bb
How do I approach this?
I tried this it kinda works (but only without additional utm params):
\/a.html\?ref\=aa(\&project\=(11|12|13))?$
I tried this, but it doesn't work when using the utm parameter:
\/a.html\?ref\=aa(\&project\=(11|12|13))?(\&utm\=.*)?$
Thanks
Itay
You don't say what platform you're using, but you'll need to escape your forward slashes and question marks if you want them to match literal characters on most platforms:
\/a.html\?ref=aa(&project=(11|12|13))?(&utm=.*)?$
You might also want to minimize your capture in the utm block in case other things come after it that you don't want:
\/a.html\?ref=aa(&project=(11|12|13))?(&utm=.*?)?$
You could use character class [123] to match either 1,2 or 3 with a single optional group, and note to escape the dot to match it literally.
\/a\.html\?ref=aa(&project=1[123])?&utm=.*$
The pattern matches:
\/a\.html match /a.html
\?ref=aa Match ?ref=aa
( Capture group
&project=1[123] Match &project=1 and then either 1,2 or 3
)? Close the non capture group to make it optional
&utm=.*$ Match &utm= followed by the rest of the line
Regex demo

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Regex to match only urls that contains a certain path

I am using the following regex
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
and it's showing me a url but I want to show only URLS that contain
/video/hd/
The following correction of the Regex above did not deal correctly with slashes
((?:https\:\/\/)|(?:http\:\/\/)|(?:www\.))?([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(?:\??)[a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]+)
You said only the whole match is used, and the regex contains no backreferences. Therefore we can replace all capturing groups (( )) in the regex by non-capturing groups ((?: )). A few of the groups are redundant, and http|https can be simplified to https?. Together this gives us
(?:https?|ftp)://[\w_-]+(?:\.[\w_-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
_ is not allowed in hostnames:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Technically - cannot appear at the beginning or end of a hostname, but we'll ignore that. Your regex doesn't allow non-default ports or IPv6 hosts either, but we'll ignore that, too.
The stuff matched by the last part of your regex (which is presumably meant to match path, query string, and anchor all together) can overlap with the hostname (both \w and - are in both character classes). We can fix this by requiring a separator of either / or ? after the hostname:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[/?][\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Now we can start looking at your additional requirement: The URL should contain /video/hd/. Presumably this string should appear somewhere in the path. We can encode this as follows:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+/(?:[\w.,#^=%&:/~+-]*/)?video/hd/(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Instead of matching an optional separator of / or ?, we now always require a / after the hostname. This / must be followed by either video/hd/ directly or 0 or more path characters and another /, which is then followed by video/hd/. (The set of path characters does not include ? (which would start the query string) or # (which would start the anchor).)
As before, after /video/hd/ there can be a final part of more path components, a query string, and an anchor (all optional).
First of all, you need a regex to match URLs (be they http, https...)
(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))
Once you got that, you need to select them but not "consume" them. You can do this with a lookahed, i.e. a regex that assert that what follows the current position is e.g. foo:
(?=foo)
Of course you will replace foo with the first regex I wrote.
At this point, you know you selected a URL; now you just constraint your search to URLs that contain /video/hd:
.*\/video\/hd\/.*
So the complete regex is
(?=(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))).*\/video\/hd\/.*
You can test it here with a live demo.

Regex - disallow combinations like "u12345"

I need to disallow combinations in this structure:
start by small "u"
the 5 following characters can not be numbers within (if starts by "u")
except this disallowed combination, allow only [a-zA-Z0-9]+
I did only regex like ^[^u][^0-9][^0-9][^0-9][^0-9][^0-9]$, because I have no idea for add only except for starting by "u".
List of some allowed combinations:
u12adfw3
u1a234
ud1235
And list of disallowed combinations:
u12345
u91
u1
I need this for aliases for system-generated name like "u20". Because I am creating system, when user can be identified by name/alias/e-mail (just looking for that string in database) and because user do not must set own alias, I want get there some limits. The destination of this regex is "pattern" in input tag in HTML or PHP check after submit.
If you have some interesting tutorials to do that/topics with simiplar problem or you just want help me, thanks you in advance :)
Greetings
If you're checking that in PHP, you could use preg_match and check with this regex:
^(?!u\d{1,5}\b)
preg_match will return false if the string begins with a u and 1 to 5 digits.
^ matches at the beginning of the string.
(?! ... ) is a negative lookahead. If what's inside matches, the whole regex will fail.
u\d{1,5} is to match u followed by 1 to 5 digits.
\b is a word boundary and will prevent any following word characters.