Extract application name from user agent

Extract application name from user agent - regex

I am using the following regex to extract application name from user agents:
^([^\s/\[]+)([\s/\[]|\z)
Application name termination character class consists of white space, backslash and [.
It reads any character that is not whitespace or / or [ until characters from the beginning until whitespace or / or [
link : https://regex101.com/r/7ndDEq/1
It is failing on some application name which has white space in between and extracts characters before white space.
eg:
Based on above regex on:
Pump Log/1300 CFNetwork/1121.2.2 Darwin/19.3.0
It extracts Pump
but the ground truth is Pump Log

Unless I'm misreading your requirements, your application name is anything up to but not including the first slash, which would just be
^([^/]+)
Or depending on your regex engine (which you should always specify when asking regex questions), you could do this with PCRE:
^(.+?)/

Try this:
^([^\s/[]+(?:\s[\w]+/)?)
It's almost there (the last slash should be removed in some matches).
The principle is simple: after capturing the required string, allow the regex to catch the optional stuff (in our case it's the second word after the first space) as well if it is available after the main match (the ? sign at the end makes this second part like optional).
UPD: this one is more general
^([^\s/[]+(?: [^/\d]+)?)
But there are two interesting points here:
I had to put a whitespace in regex, \s did not work there, I don't know how it will be in the code
It is required to have some rule what is possible after the whitespace, where we need to stop in the second optional part. If it's a slash or a bracket that will work fine but in strings like Apple iPhone10,4 iOS v13.3.1 Main/3.2.0 or POF 12.51.1859; (iPhone8,4; iOS 13.3.1; en_US; g=ON; p=ON; r=WWAN) 56BA8A93-3748-4C5E-9D00-D811FCC4EBCE; it's hard to find where to stop...

You might specify the allowed characters in a character class or use an alternation |
You can extend those to allow more characters or allowed strings.
^([^\s/\[]+(?: (?:& )?[A-Z][a-z]*)*)(?:[\s/\[]|\Z)
^ Start of string
( Capture group 1
[^\s/\[]+ Match 1+ times any char except a whitespace char, / or [
(?: Match a space (Or use \s+ to match 1+ whitespace chars which could also match a newline)
(?:& )?[A-Z][a-z]* Optionally match & and match an uppercase char A-Z followed by optional lowercase chars a-z
)* Close non capture group and optionally repeat
) Close group 1
(?:[\s/\[]|\Z) Match either a space / [ or assert the end of the string
Regex demo
Note that as you selected Python on regex101, you can use \Z to assert the position at the end of the string.

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org

This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Regex for 5-7 characters, or 6-8 if including a space (no special characters allowed)

I am trying to create a regex for some basic postcode validation. It doesn't need to provide full validation (in my usage it's fine to miss out the space, for example), but it does need to check for the number of characters being used, and also make sure there are no special characters other than spaces.
This is what I have so far:
^[\s.]*([^\s.][\s.]*){5,7}$
This mostly works, but it has two flaws:
It allows for ANY character, rather than just alphanumeric characters + spaces
It allows for multiple spaces to be inserted:
I have tried updating it as follows:
^[\s.]*([a-zA-Z0-9\s.][\s.]*){5,7}$
This seems to have fixed the character issue, but still allows multiple spaces to be inserted. For example, this should be allowed:
AB14 4BA
But this shouldn't:
AB1 4 4BA
How can I modify the code to limit the number of spaces to a maximum of one (it's fine to have none at all)?

With your current set of rules you could say:
^(?:[A-Za-z0-9]{5,7}|(?=.{6,8}$)[A-Za-z0-9]+\s[A-Za-z0-9]+)$
See an online demo
^ - Start-line anchor;
(?: - Open non-capture group for alternations;
[A-Za-z0-9]{5,7} - Just match 5-7 alphanumeric chars;
| - Or;
(?=.{6,8}$) - Positive lookahead to assert position is followed by at least 6-8 characters until the end-line anchor;
[A-Za-z0-9]+\s[A-Za-z0-9]+ - Match 1+ alphanumeric chars on either side of the whitespace character;
)$ - Close non-capture group and match the end-line anchor.
Alternatively, maybe a negative lookahead to prevent multiple spaces to occur (or at the start):
^(?!\S*\s\S*\s|\s)(?:\s?[A-Za-z0-9]){5,7}$
See an online demo where I replaced \s with [^\S\n] for demonstration purposes. Also, though being the shorter expression, the latter will take more steps to evaluate the input.

Regex to find a line with two capture groups that match the same regex but are still different

I am trying to analyse my source code (written in C) for not corresponding timer variable comparisons/allocations. I have a rage of timers with different timebases (2-250 milliseconds). Every timer variable contains its granularity in milliseconds in its name (e.g. timer10ms) as well as every timer-photo and define (e.g. fooTimer10ms, DOO_TIMEOUT_100MS).
Here are some example lines:
fooTimer10ms = timer10ms;
baaTimer20ms = timer10ms;
if (DIFF_100MS(dooTimer10ms) >= DOO_TIMEOUT_100MS)
if (DIFF_100MS(dooTimer10ms) < DOO_TIMEOUT_100MS)
I want to match those line where the timebases are not corresponding (in this case the second, third and fourth line). So far I have this regex:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))
that is capable of finding every line where there are two of those granularities. So instead of just line 2, 3 and 4 it matches all of them. The only idea I had to narrow it down is to add a negative lookbehind with a back-reference, like so:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))(?<!\1)
but this is not allowed because a negative lookbehind has to have a fixed length.
I found these two questions (one, two) but the fist does not have the restriction of having both capture groups being of the same kind and the second is looking for equal instances of the capture group.
If what I want can be achieved way easier, by using something else than regex, I would be happy to know. My mind is just stuck due to my believe that regex is capable of that and I am just not creative enough to use it properly.

One option is to match the timer part followed by the digits and use a negative lookahead with a backreference to assert that it does not occur at the right.
For the example data, a bit specific pattern using a range from 2-250 might be:
.*?(timer(?:2[0-4]\d|250|1?\d\d|[2-9])ms)\b\S*[^\S\r\n]*[<>]?=[^\S\r\n]*\b(?!\S*\1)\S+
The pattern matches
.*? Match any char except a newline, as least as possible (Non greedy)
( Capture group 1
timer Match literally
(?:2[0-4]\d|250|1?\d\d|[2-9]) Match a digit in the range of 2-250
ms Match literally
)\b Close group and a word boundary
\S*[^\S\r\n]* Match optional non whitespace chars and optional spaces without newlines
[<>]?= Match an optional < or > and =
[^\S\r\n]*\b Match optional whitespace chars without a newline and a word boundary
(?!\S*\1) Negative lookahead, assert no occurrence of what is captured in group 1 in the value
\S+ Match 1+ non whitespace chars
Regex demo
Or perhaps a broader pattern matching 1-3 digits and optional whitespace chars which might also match a newline:
.*?(timer\d{1,3}ms\b)\S*\s*[<>]?=\s*\b(?!.*\1)\S+
Regex demo
Note that {1-3} should be {1,3} and could also match 999

How to exclude a word from regex subpattern?

I am using Delphi 7 and TDIPerlRegEx. I am looking for verbs in parts of sentence which contain some specific part to identify the verb.
s1 := '(I|you|he|she|it|we|they|this|that|these|those)';
s2 := (can|should|would|could|must|want to|have to|had to|might);
RegEx_Seek_1.MatchPattern := '(*UCP)(?m) \b'+s1+'\b \b'+s2+'\b \K([^ß\W]\w{2,15})\b';
The key word which is wrongly included in result is "not"; but should be exluded:
Sample text:
... that you should not ßeat of every ...
Verb like this should be included in result:
Sample text:
lest he should put forth his hand ...
Now I would explain the part with ß sign. The ß sign says, that the original text had "not" word, and then the verb is followed. But I changed this text in previous interaction or session so the source text which I am working now is as stated above. The pattern ([^ß\W]\w{2,15}) should avoid the word which is used in negative sense. This is also why do not include the "negative" verb.
So point of the question is how to exclude the "not" word from the captured text; that is - captured by this pattern, which is either ([^ß\W]\w{2,15}) or (\W{3,15}) .
I am using this pattern to replace substrings in text.
More sample text needed?
than I can bear. And
so I might have taken her
they might dwell together
they could not ßdwell together
lest you should say,
In group 3 I expect match:
for bear, taken (or posibly have instead of taken), dwell and say.
I am trying to exclude the not word, so any verb or word following not must be excluded from 3rd group or the match completely. I am interested about group 3 only. Group 1 and 2 just specifies alternatives preceding the verb.

You may use a branch reset group to match an empty string if there is not as a whole word after a modal verb, or a notional verb otherwise:
\b(I|you|he|she|it|we|they|this|that|these|those)\s+(can|should|would|could|must|want to|have to|had to|might)\s+\K(?|(?=not\b)()|([^ß\W]\w{2,15})\b)
See the regex demo
Details
\b - a word boundary
(I|you|he|she|it|we|they|this|that|these|those) - one of the pronouns in the group 1
\s+ - 1+ whitespaces (it is already acting as a word boundary on both sides of the adjacent groups)
(can|should|would|could|must|want to|have to|had to|might) - one ofthe modal verbs
\s+ - 1+ whitespaces
\K - match reset operator
(?|(?=not\b)()|([^ß\W]\w{2,15})\b) - the branch reset group matching either
(?=not\b)() - if there is not as whole word immediately to the right, capture an empty string into Group 3
| - or (here, else)
([^ß\W]\w{2,15})\b - match and capture into Group 3 any word char other than ß and then 2 to 15 word chars with a word boundary to follow.
Note that (?m) - PCRE_MULTILINE - is only necessary if you want your ^ and $ outside of character classes match start and end of lines rather than the whole string. Since your pattern has no such anchors, (?m) is redundant.

Regex lookahead/lookbehind match for SQL script

I'm trying to analyse some SQLCMD scripts for code quality tests. I have a regex not working as expected:
^(\s*)USE (\[?)(?<![master|\$])(.)+(\]?)
I'm trying to match:
Strings that start with USE (ignore whitespace)
Followed by optional square bracket
Followed by 1 or more non-whitespace characters.
EXCEPT where that text is "master" (case insensitive)
OR EXCEPT where that that text is a $ symbol
Expected results:
USE [master] - don't match
USE [$(CompiledDatabaseName)] - don't match
USE [anything_else.01234] - match
Also, the same patterns above without the [ and ] characters.
I'm using Sublime Text 2 as my RegEx search tool and referencing this cheatsheet

Your pattern - ^(\s*)USE (\[?)(?<![master|\$])(.)+(\]?) - uses a lookbehind that is variable-width (its length is not known beforehand) if you fix the character class issue inside it (i.e. replace [...] with (...) as you mean an alternative list of $ or a character sequence master) and thus is invalid in a Boost regex. Your (.)+ capturing is wrong since this group will only contain one last character captured (you could use (.+)), but this also matches spaces (while you need 1 or more non-whitespace characters). ? is the one or zero times quantifier, but you say you might have 2 opening and closing brackets (so, you need a limiting quantifier {0,2}).
You can use
^\h*USE(?!\h*\[{0,2}[^]\s]*(?:\$|(?i:master)))\h*\[{0,2}[^]\s]*]{0,2}
See regex demo
Explanation:
^ - start of a line in Sublime Text
\h* - optional horizontal whitespace (if you need to match newlines, use \s*)
USE - a literal case-sensitive character sequence USE
(?!\h*\[{0,2}[^]\s]*(?:\$|(?i:master))) - a negative lookahead that makes sure the USE is NOT followed with:
\h* - zero or more horizontal whitespace
\[{0,2} - zero, one or two [ brackets
[^]\s]* - zero or more characters other than ] and whitespace
(?:\$|(?i:master)) - either a $ or a case-insensitive master (we turn off case sensitivity with (?i:...) construct)
\h* - go on matching zero or more horizontal whitespace
\[{0,2} - zero, one or two [ brackets
[^]\s]* - zero or more characters other than ] and whitespace (when ] is the first character in a character class, it does not have to be escaped in Boost/PCRE regexps)
]{0,2} - zero, one or two ] brackets (outside of character class, the closing square bracket does not need escaping)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract application name from user agent - regex

Unless I'm misreading your requirements, your application name is anything up to but not including the first slash, which would just be ^([^/]+) Or depending on your regex engine (which you should always specify when asking regex questions), you could do this with PCRE: ^(.+?)/

Related

Regex Help required for User-Agent Matching

Regex for 5-7 characters, or 6-8 if including a space (no special characters allowed)

Regex to find a line with two capture groups that match the same regex but are still different

How to exclude a word from regex subpattern?

Regex lookahead/lookbehind match for SQL script

Categories

Resources