Regex with required text matches in the string - regex

I'm trying to the smtp relay from the sendmail logs and to make it reliable I need to require multiple strings in the log entry. An example of a log file entry would be like this:
2018-02-20T19:35:35+00:00 mx01.example.org sendmail[12345]: v1k82343VJ8K: to=<user#foobar.com>, delay=00:00:01, xdelay=00:00:01, mailer=esmtp, tls_verify=OK, relay=mailserver1.foobar.com. [1.1.1.1], dsn=2.0.0, stat=Sent
I can't just key in on "relay=" because the particular relay name I need only appears in the log entry line that contains "to=" with it.
How do I write my regex so that:
The words "sendmail", followed by "to=", then followed by "relay=" all appear in the same log entry.
After "relay=" I match any letter, digit, and character until the comma.
The end result should be:
mailserver1.foobar.com. [1.1.1.1]

See regex in use here
^.*\bsendmail\b.*\bto=.*relay=\K[^,]*
^ Assert position at the start of the line
.* Match any character any number of times
\b Assert position as a word boundary
sendmail Match this literally
\b Assert position as a word boundary
.* Match any character any number of times
\b Assert position as a word boundary
to= Match this literally
.* Match any character any number of times
relay= Match this literallyl
\K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
[^,]* Match any character except , any number of times
Result: mailserver1.foobar.com. [1.1.1.1]

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Regex matching multiple groups

I am very new to Regex and trying to create filter rule to get some matches. For Instance, I have query result like this:
application_outbound_api_external_metrics_service_plus_success_total
application_outbound_api_external_metrics_service_plus_failure_total
application_inbound_api_metrics_service_success_total
application_inbound_api_metrics_service_failure_total
Now I want to filter ONLY lines which contains "outbound" AND "service_plus" AND "failure".
I tried to play with groups, but how can I create an regex, but somwhere I am misundersteanding this which contains in wrong results.
Regex which I used:
/(?:outbound)|(?:service_plus)|(?:failure)/
You should use multiple lookahead assertions:
^(?=.*outbound)(?=.*service_plus)(?=.*failure).*\n?
The above should use the MULTILINE flag so that ^ is interpreted as start of string or start of line.
^ - matches start of string or start of line.
(?=.*outbound) - asserts that at the current position we can match 0 or more non-newline characters followed by 'outbound` without consuming any characters (i.e. the scan position is not advanced)
(?=.*service_plus) - asserts that at the current position we can match 0 or more non-newline characters followed by 'service_plus` without consuming any characters (i.e. the scan position is not advanced)
(?=.*failure) - asserts that at the current position we can match 0 or more non-newline characters followed by 'failure` without consuming any characters (i.e. the scan position is not advanced)
.*\n? - matches 0 or more non-line characters optionally followed by a newline (in case the final line does not terminate in a newline character)
See RegEx Demo
In Python, for example:
import re
lines = """application_outbound_api_external_metrics_service_plus_success_total
application_outbound_api_external_metrics_service_plus_failure_total
application_inbound_api_metrics_service_success_total
application_inbound_api_metrics_service_failure_total
failureoutboundservice_plus"""
rex = re.compile(r'^(?=.*outbound)(?=.*service_plus)(?=.*failure).*\n?', re.M)
filtered_lines = ''.join(rex.findall(lines))
print(filtered_lines)
Prints:
application_outbound_api_external_metrics_service_plus_failure_total
failureoutboundservice_plus
You need to make use of lookaheads to assert that multiple things need to exist regardless of the order they exist:
^(?=.*(?:^|_)outbound(?:_|$))(?=.*(?:^|_)service_plus(?:_|$))(?=.*(?:^|_)failure(?:_|$)).+$
^ - start line anchor
(?= - open the positive lookahead aka "ahead of me is..."
.* - optionally anything
(?:^|_) - start line anchor or underscore
outbound - the word "outbound"
(?:_|$) - underscore or end line anchor
The underscores and line anchors ensure we don't have false positives like "outbounds" or "goutbound"
) - close the positive lookahead
Rinse and repeat for "service_plus" and "failure"
Since we haven't captured any chars yet, the second and third lookaheads allow for searching the terms in any order
.+$ - capture everything till the end of the line
https://regex101.com/r/Zhl4Mf/1
If the order does matter then build a regex in the correct order:
^.*_outbound_.*_service_plus_failure_.*$
https://regex101.com/r/b7O5YK/1

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

Regex for alphanumeric word and should not be like RUN123456

I want to apply regex on a string to get alphanumeric value and the value should not start with the RUN substring followed with any digit, e.g. RUN123456.
Below is the regex I am using to get alphanumeric value
regex='[A-Z]{2,}[_0-9a-zA-Z]*'
Sample Input:
CY0PNI94980 Production AutoSys Job has failed. Call 249-3344. EC=54. RUN130990.
The matches can include CY0PNI94980 and EC, but not RUN130990.
Kindly help me on this.
You may match the strings matching your pattern excluding all those starting with RUN and a digit:
\b(?!RUN[0-9])[A-Z]{2,}[_0-9a-zA-Z]*
See the regex demo
If you do not care if you match Unicode letters or digits or not, you may contract [A-Za-z0-9_] with \w and use
\b(?!RUN[0-9])[A-Z]{2,}\w*
Details
\b - a word boundary
(?!RUN[0-9]) - a negative lookahead that fails the match if there is RUN and any ASCII digit immediately to the right of the current location
[A-Z]{2,} - 2 or more uppercase ASCII letters
[_0-9a-zA-Z]* / \w* - 0 or more word chars (letters/digits/_).

Hive RegexSerDe Multiline Log matching

I am looking for a regex that can be fed to a "create external table" statement of Hive QL in the form of
"input.regex"="the regex goes here"
The condition is that the logs in the files that the RegexSerDe must be reading are of the following form:
2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
This is a special one.
This has a message that is multi-lined.
This is line number 4 of the same log.
Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.
I am using the following create external table code:
CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';
Here's the thing:
It is able to pull all the FIRST LINES of each log. But not the other lines of logs that have more than one lines. I tried all links, replaced \z with \Z at the end, replaced \A with ^ and \Z or \z with $, nothing worked. Am I missing something in the output.format.string's %4$s? or am I not using the regex properly?
What the regex does:
It matches the timestamp first, followed by the log type (DEBUG or INFO or whatever), then the ID (mix of lower case alphabets, numbers and hyphens) followed by ANYTHING, till the next timestamp is found, or till the end of input is found to match the last log entry. I also tried adding the /m at the end, in which case, the table generated has all NULL values.
There seem to be a number of issues with your regex.
First, remove your double square brackets.
Second, \A and \Z/\z are to match the beginning and end of the input, not just a line. Change \A to ^ to match start-of-line but don't change \z to $ as you do actually want to match end-of-input in this case.
Third, you want to match (.*?), not (.*)?. The first pattern is ungreedy, whereas the second pattern is greedy but optional. It should have matched your entire input to the end as you allowed it to be followed by end-of-input.
Fourth, . does not match newlines. You can use (\s|\S) instead, or ([x]|[^x]), etc., any pair of complimentary matches.
Fifth, if it was giving you single line matches with \A and \Z/\z then the input was single lines also as you were anchoring the whole string.
I would suggest trying to match just \n, if nothing matches then newlines are not included.
You can't add /m to the end as the regex does not include delimiters. It will try to match the literal characters /m instead which is why you got no match.
If it was going to work the regex you want would be:
"^([0-9:- ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) ([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\z)"
Breakdown:
^([0-9:- ]{19},[0-9]{3})
Match start of newline, and 19 characters that are digits, :, - or plus a comma, three digits and a space. Capture all but the final space (the timestamp).
(\\[[A-Z]*\\])
Match a literal [, any number of UPPERCASE letters, even none, a literal ] and a space. Capture all but the final space (the error level).
([0-9a-z-]*)
Match any number of digits, lowercase letters or - and a space. Capture all but the final space (the message id).
([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\Z)
Match any whitespace or non-whitespace character (any character) but match ungreedy *?. Stop matching when a new record or end of input (\Z) is immediately ahead. In this case you don't want to match end of line as once again, you will only get one line in your output. Capture all but the final (the message text). The \r?\n is to skip the final newline at the end of your message, as is the \r?\Z. You could also write \r?\n\z Note: capital \Z includes the final newline at the end of the input if there is one. Lowercase \z matches at end of input only, not newline before end of input. I have added \z? just in case you have to deal with Windows line endings, however, I don't believe this should be necessary.
However, I suspect that unless you can feed the whole file in at once instead of line-by-line that this will not work either.
Another simple test you can try is:
"^([\\s\\S]+)^\\d"
If it works it will match any full line followed by a line digit on the next line (the first digit of your timestamp).
Following Java regex may help:
(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})\s+(\[.+?\])\s+(.+?)\s+([\s\S\s]+?)(?=\d{4}-\d{1,2}-\d{1,2}|\Z)
Breakdown:
1st Capturing group (\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})
2nd Capturing group (\[.+?\])
3rd Capturing group (.+?)
4th Capturing group ([\s\S]+?).
(?=\d{4}-\d{1,2}-\d{1,2}|\Z) Positive Lookahead - Assert that the regex below can be matched.1st Alternative: \d{4}-\d{1,2}-\d{1,2}.2nd Alternative: \Z assert position at end of the string.
Reference http://regex101.com/
I don't know much about Hive, but the following regex, or a variation formatted for Java strings, might work:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) \[([a-zA-Z_-]+)\] ([\w-]+) ((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*)
This can be seen matching your sample data here:
http://rubular.com/r/tQp9iBp4JI
A breakdown:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) The date and time (capture group 1)
\[([a-zA-Z_-]+)\] The log level (capture group 2)
([\w-]+) The request id (capture group 3)
((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*) The potentially multi-line message (capture group 4)
The first three capture groups are pretty simple.
The last one might is a little odd, but it's working on rubular. A breakdown:
( Capture it as one group
(?:[^\n\r]+) Match to the end of the line, dont capture
(?: Match line by line, after the first, but dont capture
[\n\r]{1,2} Match the new-line
\s Only lines starting with a space (this prevents new log-entries from matching)
[^\n\r]+ Match to the end of the line
)* Match zero or more of these extra lines
)
I used [^\n\r] instead of the . because it looks like RegexSerDe lets the . match new lines (link):
// Excerpt from https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java#L101
if (inputRegex != null) {
inputPattern = Pattern.compile(inputRegex, Pattern.DOTALL
+ (inputRegexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));
} else {
inputPattern = null;
}
Hope this helps.